Development of Part of Speech Tagging ... Analysis Software for Chinese Text

Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text by Daniel Tianhang Hu Submitted to the Department of Electrical Engineering and Computer Science in Partial Fulfillment of the Requirements of the Degrees of Bachelors of Science in Electrical Engineering and Computer Science and Masters of Engineering in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY May 2001 ©Daniel T. Hu. All rights reserved BARKER The author hereby grants to MIT the permission to distribute publicly paper and electronic copies of this thesis and to grant others the right to d )VASSACH5ENSTITUTE OF T ECHNOLflGY LIBRARIES Author Department of Electrical Engineering and Computer Science May 2 9 th 2001 Certified by I Amar Gupta Co-Director PROFIT Intiative Thesis Supervisor May 2 9 th2001 Accepted by Arthur C. Smith Chairman, Departmental Committee on Graduate Theses Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text By Daniel Tianhang Hu Submitted to the Department of Electrical Engineering and Computer Science in Partial Fulfillment of the Requirements of the Degrees of Bachelors of Science in Electrical Engineering and Computer Science and Masters of Engineering in Electrical Engineering and Computer Science Abstract The goal of this project is to complete initial development of Chinese Automatic Content Extraction software. This Chinese processing system is integrated with Alembic system (MITRE in-house NLP software for text processing), and has complete segmentation, part of speech tagging, and named entity detection capabilities. First phase development involves establishing segmentation and part of speech tagging capabilities. Three different segmentation algorithms are implemented and integrated with the Alembic system, and one off-the-shelf software is tested and described. Different segmentation and part of speech guidelines are explored. Efforts include compiling different part of speech wordlists, segmentation wordlists, and supplementary wordlists to aid in segmentation, part of speech tagging, and machine training. Results are on par with industry standards. We found that replacement of the wordlists and resources within the existing segmentation/POS system can results in significant improvements. State-of-the-art software from external sources is shown to work well with the Alembic environment. Should we choose not to develop our own segmenter, license for OTS software can be readily obtained. Second phase development involves named entity detection. Training is done with pre-tagged, cleaned, and re-formatted Chinese newswire data. While training shows promise, higher performances are quickly achieved through manual rule generation based on analysis of entity POS formations and additional specialized wordlists. Throughout the development effort, guidelines are tailored to better meet the needs of Entity Detection and Tracking, and Relations Detection and Characterization. Thesis Supervisor: Dr. Amar Gupta Co-Director PROFIT Initiative Massachusetts Institute of Technology Author: Daniel Hu MIT-EECS Page I Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 Acknowledgement I would like to thank Dr. Amar Gupta for providing this opportunity to work with MITRE Corporation. His support has been non-wavering for the past year. Dr. Doug Norman has extended to me a helping hand at the most critical of all moments and I am forever grateful to him for his believe in me. I would also like to thank Dr. David Day and Dr. Mark Vilain for guidance from day one of this project. Their juggling act between numerous projects and meetings and hours with me over the computer is unbelievable. I also want to thank Dr. Qian Hu for helping me in every detail of the project and the thesis, and for being such a wonderful friend. I have to thank Ben Wellner for his wonderful tutorials, which has make impossible tasks simple. I want to extend my gratitude to all of the NLP Group. I have found a helping hand whenever I needed it. My dear friends: I couldn't have survived through these five years if it was not for your understanding and support Mom and dad: everything I have successfully accomplished in life can be accredited back to your absolute love and positive influence. The woods are lovely, dark and deep, But I have promises to keep, And miles to go before I sleep, And miles to go before I sleep. -Robert Frost (1874-1963) Author: Daniel Hu MIT-EECS Page 2 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 Contents 1 Organization of the Thesis ............................................................ P6 2 P7 3 Introduction ............................................................................. 2.2 Building Chinese text processing system based on Alembic system ... P8 2.3 T hree phases ................................................................... P9 2.4 Previous studies ............................................................... P 1o 2.5 R esource challenges .......................................................... P13 B ackground ............................................................................. P 15 3.2 A lembic system ............................................................... P 15 3.3 Chinese processing system ................................................... P18 3.4 Segmentation problem in Chinese ......................................... P21 3.5 Guideline for tagging documents .......................................... P25 3.6 Part of speech tagging problem in Chinese ............................... P26 3.7 Phrasing problem ............................................................ P29 3.8 EDT and RDC.......................................P33 3.9 Resources: Treebank and PKU People's Daily data .................... 4 M etho d s ................................................................................ P34 P 37 4.2 Sentence tagging ............................................................ P37 4.3 Segmentation ................................................................ P38 4.3.1 MMS......................................P38 4.3.2 Personal Names ................................................. P42 4.3.3 Heuristic Rule Based system ................................. P44 4.3.4 Unigram frequency model .................................... P46 4.3.5 PKU OTS software ............................................. P47 4.3.6 Rule based Alembic segmentation module ................. P49 4.3.7 LISP Part of Speech Tagging ................................. P50 4.4 Phrasing system in Alembic and OTC software ......................... P52 4.4.1 T raining .......................................................... P 53 4.4.2 Testing and scoring and error tracking ....................... P54 4.4.3 Compiling the frequency data ................................. P57 4.4.4 Generating wordlists to aid tagging sequences ............. P59 Author: Daniel Hu MIT-EECS Page 3 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 4.4.5 5 Manually Generated rules ..................................... Resu lts .................................................................................. 5.2 Results from phase 1 ......................................................... 5/29/01 P60 P6 2 P62 5.2.1 Sentence tagging ................................................ P62 5.2.2 Segmentation .................................................... P62 5.2.3 Part of speech tagging .......................................... P63 5.3 Results from phase 2 ........................................................ 5.3.1 Machine Learned Rules for NE Task ........................ 5.3.2 Improvements with Addition of Custom Wordlists and Manually Generated Rules for NE Task ......................... 6 D iscussion .............................................................................. P66 P66 P68 P 70 6.2 Separation of Phase 1 and Phase 2 ......................................... P70 6.3 Segmenation .................................................................. P70 6.4 Post-process for OTS module .............................................. P71 7 C onclusion .............................................................................. P 72 8 Bibliography, references, and resources ............................................ P73 8.2 Web based resources ......................................................... P73 8.3 Academic papers ............................................................. P73 8.4 MIT RE reports ................................................................ P75 8.5 Operating Manuals and others .............................................. P75 Appendix A ................................................................................. P 77 Al Sample sentence tagged output .................................................. P77 A2 Sample segmentation output from MMS ..................................... P78 A3 Sample segmentation output from three word model ........................ P79 A4 Sample segmentation output from Unigram model ........................... P81 A5 Sample segmentation output from PKU segmenter ........................... P82 Appendix B ................................................................................. P 84 B 1 Sample spec file ................................................................... P 84 B2 Sample learned rules file for phrasing .......................................... P84 B3 Manually Written Rules for Organization Tagging ........................... P87 B4 MMS Segmenter ................................................................... P89 Author: Daniel Hu MIT-EECS Page 4 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text B5 Translating wordlists from Web based dictionary ........................... Appendix C ............................................................................... 5/29/01 P95 . P97 Cl Evaluation output of the trained phraser ....................................... P97 C2 Evaluation output of the manual phraser ...................................... P99 A ppendix D ................................................................................ P 100 D l Surnam e list ...................................................................... P 100 D2 Foreign phonetic translations characters ..................................... P101 D3 Organization and locations suffix ............................................. P102 A ppendix E ................................................................................. El University of Pennsylvania Chinese Treebank project ...................... P 103 P103 E2 Beijing University Institute of Computational Linguistics People's D aily project ........................................................... P 104 E3 MET 2 and BBN data ............................................................ P105 E4 Frequency data from PKU People's Daily project ........................... P107 E5 Frequency data from U. of Penn. Treebank project ......................... P108 E6 Sample organization part-of-speech frequency results from People's D aily .................................................................... P 109 E7 Phrase Tagged People's Daily Data (using Manually generated rules) ... P109 E8 Correctly tagged People's Daily data .......................................... P110 Author: Daniel Hu MIT-EECS Page 5 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 1. Organization of Thesis This thesis covers a broad range of topics. The introduction gives an overview of the system that we plan to develop. The background section sets up the problems that we face when building the system. This includes a more detailed description of the software system, the guidelines for the task that we are trying to perform, and the unique problems when dealing with Chinese. After the problems are presented, the methods section will provide the solutions we have envisioned. It will include information on how we utilize different resources, and different ways we can solve the same problem. The results section will present a summary of outputs from the various process schemes described in the method section. The discussion section following the results section will talk about some of the unresolved issues in our system. Finally, we will conclude with a very short summary of the discoveries. Author: Daniel Hu MIT-EECS Page 6 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 2. Introduction The accumulation of information in this electronic age is astonishing. Yet we have very little intelligent tools that will help individuals manage this goliath and help make critical decisions. Natural language processing research communities are looking closely at this problem and are attempting to build intelligent software that can "read" and "understand" as a person. This research is not only limited to English, however. As information in other languages multiplies at an even higher rate on the web, more attention has being placed on building a multi-lingual information-processing system. One such language of interest is Chinese. Used by a fifth of the world's population, Chinese language is playing an ever-expanding role on the World Wide Web for information exchange. Therefore, Chinese language processing is of great interest to government sponsors and people in the research community. Past efforts at processing Chinese have been undertaken by numerous research institutions and company. The primary focus of those researches has separately been segmentation, part of speech tagging, transliteration, and parallel corpus computing. Those researches delved deeply into the finite points of Chinese language, and addressed many important linguistic and computational problems. This project has a different goal. We are aimed at building a complete end to end system that is capable of doing Entity Detection and Tracking, and Relation Detection and Characterization, while utilizing the most advanced methods and results from other independent Chinese NLP efforts. The project is named ACEC, which stand for Automatic Content Extraction for Chinese. In this introduction chapter, we hope to address three general topics. First we will introduce the Alembic processing environment. Then we will address the three phases of the project. Then we will briefly talk about the existing studies and language resources for Chinese language. Finally we will talk about our resource limitations. Author: Daniel Hu MIT-EECS Page 7 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 2.2 Building Chinese Text Processing System Based on Alembic System This end-to-end Chinese processing system will be based on an English processing system called the Alembic. "Alembic software project for English arose out of the congruence of a need and of a technology. As information systems become increasingly ubiquitous in the worlds of work, government, and even the home, so grows the need for improving the acuity of information systems. The principal value of Alembic system is that it provides fast and accurate models of language to identify aspects of a text that are of general interest to information system users. For instance, Alembic can identify proper names and classify them as to whether they designate people, places, organizations, and others. In addition, Alembic can be tailored to search for particular classes of events and to report them in a database format." I The technical approach underlying Alembic is a new class of trainable language processing systems, rule sequence processors. These systems can either be configured to a new task by hand, or they can be configured automatically by providing a learning procedure with a collection of examples. In the latter case, the learning procedure automatically generalizes the examples into the same kind of linguistic model a human engineer would create by hand. This allows for unprecedented ease of development, since Alembic can be configured to an entirely new task by non-expert users who only need to create the collection of examples. This collection can be readily and rapidly authored with the Alembic Workbench 2 . 1 Description 2Summarized of the Alembic System Used for MUC-6, MITRE 97 from http://www.mitre.org/resources/centers/it/g063/alembic.html Author: Daniel Hu MIT-EECS Page 8 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 In addition to trainability, the rule sequence processors underlying Alembic are extremely fast, and confer on the system one of the highest processing throughput rates in the literature.3 The technical benefits of Alembic stated above can be adapted to a multilingual system. In fact, the rule based system and training capability make it easy to be adapted for other languages quickly and at a low cost. With minimum adjustment to the modules of the software system, and the additions of new modules, languages such as Spanish and Chinese can be added to the language capability. The end goal is to have a Chinese text processing system that can achieve similar results as the English processing system. More details about Alembic processing system and how it is modified will be presented in the background and the methods section. 2.3 Three Phases of the Project There are roughly three phases to the overall ACEC project. The first phase is exploration of text parsing and part-of-speech tagging on Chinese. The second phase is exploration of named entity tagging, and nominal entity phrasing. The third phase is Entity Detection and Tracking, and Relations Detection and Characterization. This thesis paper primarily focuses on first and second phase of this process. Alembic system is a highly adaptable system of rule based learning machine. But to achieve the high quality results, a preprocessing module need to be added first to address the fundamental differences between Chinese and English. First, Chinese does not have overt word boundaries as English. Secondly, Chinese doesn't use inflection to mark verb tenses or part of speech. Thirdly, some Chinese expressions do not have direct counters in English, making direct mapping of English rules to Chinese impossible. These differences can be addressed through an initial module that performs several functions: 1, segmentation, and 2, parsing of sentences into meaningful word chunks, and 3, 3Quoted from http://www.mitre.org/resources/centers/it/gO063/alembic.html Author: Daniel Hu MIT-EECS Page 9 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 importation of a new set of tags that could address difference in semantics. This initial module could then adapt Chinese text to a familiar English format and the rest of the processing may be done with more familiar methods. More details about the different segmentation and part-of-speech systems will be presented in the background and methods section. The named entity tagging and nominal entity tagging performed on the second phase of the project will use much of the pre-constructed algorithms in Alembic language processing environment, namely, the training capabilities to do NE phrasing, and the flexible rules environment to construct manual syntactic rules. We will also be able to add wordlists and lexicon to this array of processing tools. This thesis will not deal closely with EDT and RDC, which is the third phase of ACEC project. However, the design considerations for earlier systems are closely tied with requirements of EDT and RDC. Such design considerations are explicitly noted. The discussion section of this paper will also talk briefly about initial explorations of EDT and RDC, but we do not have results to back up the our theories or assertions. Phase I Sentence parsing Segmentation Part of Speech tagging Phase II Named Entity Tagging Lexicon resource generation Template building Phase III* EDT and RDC Inference Table 1 Phase I and Phase II are the focus of this thesis project. Phase III is explored but will not be part of the results 2.4 Previous Studies Chinese is such a commonly used language in the world that prior study in this language is extensive. We have done extensive research into the areas that are relevant to this project. The sponsor's guideline explicitly encouraged the use of off-the-shelf (OTS) software whenever those are available and meet the performance specifications. Thus, a big part of this project is identifying such resources, evaluating the performance, and integrating acceptable outside resources to our system and putting together the separate Author: Daniel Hu MIT-EECS Page 10 5/29/01 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text pieces that is the state of the art in their own respective fields. The most important criteria for evaluation and acceptance of the outside resources are the accuracy of segmentation and part of speech tagging, and the lexicon coverage. Named Entity tagging has not been a major focus of prior research for Chinese. Thus technology for NE task stem from the NE tagging capabilities of Alembic English processing system. A wealth of information exists for tokenizing Chinese, chunking an otherwise contentious string of character with no white space in between into meaningful units of words. We have found good versions of segmentation software from University of New Mexico, University of Illinois, and some online sites 4 . These software include Erik Peterson's implementation of the Maximum Matching String Algorithm5 , Chih-Hao Tsai's implementation of 3 word look ahead model 6 . There are also articles on statistical methods 7, heuristic rule based methods8 , Chinese name identification methods9 , foreign name identification methods' 0 , and segmentation methods that uses part of speech information and Viterbi algorithm". The first phase of the project involves putting together the wealth of information already produce by other people to complete the segmentation and part of speech task. We are attempting to emulate some of those methods described by the research articles and adapted downloadable codes into our system. The codes that are directly used are segmentation codes. Some of the important methods we explored are maximum matching, three word look ahead, and unigram frequency model. Adapted into these codes are important Chinese and foreign name identification models that included 4 See reference for web based materials 5 http://www.MandarinTools.com 6 http://www.geocities.com/hao510/mmseg/ is the location of the code. Algorithm lifted from 1992 paper by Chen and Liu. 7 Richard Sproat and Nancy Chang, 1994 8 Chen and Liu, 1992 COLING 9 Large Corpus Based Methods for Chinese personal Name Recognition, Journalof Chinese Information Processing, 6(3):7-15 10 Large Corpus Based Methods for Chinese personal Name Recognition, Journal of Chinese Information Processing, 6(3):7-15 " David D. Palmer and John D. Burger, Chinese Word Segmentation and Information Retrieval, 1997 Author: Daniel Hu MIT-EECS Page 11 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 valuable comprehensive wordlists on Chinese surnames and etc. These wordlists are build from past research efforts and improvements with the current pass. Aside from the segmentation resources we collected or developed, there is also handannotated data critical to this whole project. As stressed earlier, one of the chief benefits of Alembic system is it's learning capabilities. Machine training requires large amount of manually corrected data. Due to limited time and human resources, manual tagging our own data is impossible. The next best thing is to use data from outside resources. Some of the data that are used in this project come from University of Pennsylvania Treebank project12 and Beijing University ICL's People's Daily data13 . Their segmentation guidelines are not quite the same. Nor are their part of speech tag sets. More details about these two corpuses and some other segmented data sets will be provided in the backgrounds section. We have also attached sample files from these important corpus in the Appendix, and tables that provides definition to different tag sets. MITRE also previously studied processing Chinese language' 5 . This was done without the support of staff with Chinese language background. Thus, all the performance levels were achieved by machine learning. There were resources that were developed in conjunction with the previous effort. Those include derived wordlists of national names, geographical locations, foreign names, and world capitals' 6 . Those wordlists are of limited use, because editing was not possible due to lack of personnel with Chinese knowledge. During this second initiative at processing Chinese, those resources are cleaned up and reorganized. In additions, English wordlists were used to build a parallel corpus in Chinese with web mining software developed for this project. There are large numbers of unnamed resources for this project. They are mentioned either in the thesis or listed under bibliography information. But even with careful accounting, there is going to be missing pieces. We will attempt to do a through job in 12 Appendix El 13 Appendix E2 1 Table 4 15 David D. Palmer and John D. Burger, Chinese Word Segmentation and Information Retrieval, 1997 16 Figure 11 Author: Daniel Hu MIT-EECS Page 12 5/29/01 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text explaining the most important parts to our system and how each of those parts plays an irreplaceable role. However, we apologize in advance for any thing missing or misrepresented. 2.5 Resource Challenges Finally, we would like to highlight some of the limitation of this project before we finish introduction and go into further detail. The first limitation is time. This project was initiated on February 2001. The time that the thesis is being written is May 2001. In three month, we completed initial exploration into phase I and phase II of the end-to-end system, and is delving into phase III. The results are good, but are still far from the numbers of the parallel English system. Better results are almost guaranteed with better wordlists, more training, and more detailed rules. But the necessary hours needed for those activities just do not exist. Secondly, funding for this project has not being 100%. Some of the obviously superior resources have not materialized due to stalls in the licensing department. We can only complete analysis on parts where we can generate output, and fake the input into the later modules that requires the missing link. This is unfortunate because the numbers are bond to change when real modules come to replace the imaginary one. We will take that into account when writing the results and provide further projections in the discussion section. Lastly, we are working with limited linguistic resources. In a perfect world, we will hand generate training data with the best part-of-speech and segmentation. This will optimize the capabilities of the named entity detection rules and provide the best foundation for future EDT and RDC. But this is a time consuming and frustrating process. Thus, a compromise is reached by using less-than-perfect resources from other institutions and piecing them together in a patch-work of training data and derived wordlists. Author: Daniel Hu MIT-EECS Page 13 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 Even with the limitations mentioned above, our exploration for the first and second phase of building a Chinese text processing system has been through and detailed. This report will be useful to others carrying out the rest of ACE Chinese project, both as a reference to all the available resources and as a general guideline. Author: Daniel Hu MIT-EECS Page 14 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 3. 5/29/01 Background The background section defines the problems we are facing when building a Chinese text processing system. The first section of the background will describe in more detail the Alembic system mentioned in the introduction. The second section will follow with proposed design for Chinese processing systems. The third section will define the Chinese segmentation problem in more detail and highlight the different guidelines towards this problem. The fourth section talks about the tagging structure used. The fifth section describes the part-of-speech tagging problem and ambiguities. We will also provide some examples of how Chinese is different from English. An important table of different part of speech tag sets will be provided. Closely related to part-of-speech tagging is the phrasing problem. This is described in the fifth section. In this large section we will also provide guidelines to named entity tagging. The sixth section provides guidelines to EDT and RDC as defined by ACE groups. Here we will also detail some of the processing steps needed to do EDT and RDC in the English system. In the seventh section more will be said about two of the important annotated corpuses: Penn treebank, and PKU People's Daily. 3.2 Alembic Processing System'7 The Alembic system divides roughly into three general sections, a C and UNIX based pre-processor, also called the pre-alembic, a LISP-based phraser and syntactic analyzer, and a LISP based inference system. But as we shall see, the boundaries are not as clear once we enter the domain of LISP system, as the flow and accumulation of data is done from all levels and it can not be modeled as a straight series of one input, one output modules. For the English language, the pre-processor, also called by the name pre-alembic, is a patch work of code that dates back to the earliest text processing efforts at MITRE. The preprocessor divides into two units. First stop of pre-alembic is called zoning. Often the 17 Description of the Alembic System Used for MUC-6, 1997 Author: Daniel Hu MIT-EECS Page 15 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 input into language processing software is not clean, coming from various different sources, with various different tags and errors. Zoning is the process which strips the document of symbols that might cause errors. Then it will pass the document onto the sentence/lexicon detection module. This module will identify sentence boundaries, and place <LEX></LEX> tags around words and end-of-sentence punctuations. The LEX tags would include code for part of speech information about the words. Such as "noun" or "verb" or "adjective". The output of the pre-alembic system is a clean, marked up piece of document, which is ready to be further processed by the LISP system. The building block for LISP processing environment are lists of rules. Each rule implicitly contains a simple linguistic property. For example, we might have a rule that equivocally says: "if a noun follows an adjective, that adjective goes to modifying this noun." This rule might apply 90% of the times. For the 10% of the exception, other simple, straightforward rules will attempt to patch up the errors. The list of rules for each module is applied serially to perform the tagging required of that list, and every rule is applied only once. For a complete phrasing module, there will typically be one to two hundred rules. The tagging specified by the most recently applied rules overwrites whatever tagging has been done before. Thus the general trend in writing rules is to go from broad to narrow. The benefits of this rule-based system include speed, flexibility, and machine learning capabilities. Because each rule is almost like a simple observation, a computer with a large training data can generate and re-arrange the rules until it is able to get very high scores on the training corpus. The modules for the LISP phraser start with the name entity tagger. This system is responsible for tagging all personal names, locations, organizations, geo-social-political groups (GSP), and governments. Then the tagged data is passed to the phraser module, which is responsible for expanding the nominal entities into nominal phrases. For example, the following phrase, "SenatorKerry of Massachusetts", would first be tagged with "<post>Senator</post> <name>Kerry</name> of <location>Massachusetts </location>" in the named entity tagger, this sequence would be expanded into "<person><post>Senator</post> <name>Kerry</name> of <location>Massachusetts Author: Daniel Hu MIT-EECS Page 16 . Pre-Alembic Webpages News wires etc. CD 1. - Sentence / lexicon detection Zoning (cleaning up) . .. . - ................... .............. ........................ . .... -L-I LEXICON . ............... *..... Proposed Chinese processing Module which combine segmentatior . named entity tagging, part-of-speech II and phrasing. :Part of speech tagging : i : -- Ph Rules rz F. Name entity tagging --- - - - - - - - -- --- - - - - - - - - - - Wordlists H C-.' Chunking, phrase finder EDT RDC DSE ................................... ................ LEXICON - LISP-based inference Grammatical relations --- Ct -- Coreference Domain ____I________ 0 5/29/01 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text </location></person>" in the phraser module. More details about phrasing will be provide in later section of this chapter. The tagged data would then be passed on to the grammatical relations module. Like the name implies, this module is responsible for establishing the relationship between the nominal and verbal phrases. The final part of the LISP processing system is the co-reference module. After the grammatical relationships have been established in each sentence, this module will attempt to connect the nominal phrases. All of the main LISP modules feed into a temporary data bank similar to a template. As more information is gathered, more tokens and ties are added into the data lists. Higher level processing such as Entity Detection and Tracking, and summarization will use this database to complete the job. 3.3 Chinese Processing System A different schematic than the one for English system is needed for processing Chinese. The problems unique to Chinese that will not be resolved using existing framework is segmentation and part-of-speech tagging. More details about the problem of segmentation and part-of-speech tagging will be provided in sections 3.4 and 3.5. In this section, we will talk about some possible system solutions, and introduce the resources that we are exploring. To summarize our needs, a component needs to be created that will take the input from the zoning module and output segmented, part-ofspeech tagged, and sentence tagged text. This data will then be passed on to the LISP phraser and inference module built into Alembic, and the rest is similar to the English system. This data flow illustrated in figure 2 and in the example below. W IM VL Input: H RW VAk E~H n r (Our nation is currently moving from subsistence farming to commercial farming) 0utput:<S><t> 4</t><n>RM </n><n>}, </n><d>E i</d><ppos> </ppos><vn >n ',</vn><n>-V R </n><v>-Vr-J</v><n>iM e</n><n>-V R </n><w>,, </w></S> Author: Daniel Hu MIT-EECS Page 18 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text Zoning BLACK BOX that performs: I Sentence tagging II. Segmentation III. Part of Speech tagging 5/29/01 Phrasing Inference Etc. Figure 2 In this input/output box, we can treat segmentation and part-of-speech tagging separately, or we can treat them together. The argument for treating them separately is increased processing speed and ease of development. On the other hand, part-of-speech information may aid in clarifying segmentation ambiguities. The desired solution would be to build and test both systems. But there are resource limitations. To build a combined system from scratch, we needed a very good part-of-speech dictionary, which is unavailable. So by default, the system developed and tested in-house all treated partof-speech and segmentation separately. The first stage is the segmentation. Then the output from the segmenter is fed into a LISP based part-of-speech tagger using a very limited 50,000 word part-of-speech lexicon. Segmentation of Chinese sentences is always treated one of two ways. There is the purely statistical approach, and then there is the lexicon approach18 . Past researches have generally shown that lexical approach is more effective and requires less processing power. Thus, all segmentation paths that we are exploring fit under the lexical approach category. Chinese is read from left to right, and majority of the segments can be determined by a technique called maximum matching string (MMS)'9. Existing code for this method can be downloaded from www.MandarinTools.com, developed by Peter Erikson. This is our baseline. Other techniques that go beyond this simple algorithm are maximum matching string with 3 word look ahead (MMS3) developed by Tsai 2o based on Chen and Liu's 1992 paper2 1 , and 3 word look ahead model that take advantage of frequency data that is modified from the MMS model. The following figure illustrates the data flow diagram. 18 Richard Sproat and Nancy Chang, 1994 19 20 21 http://www.MandarinTools.com http://www.geocities.com/hao510/mmseg/ Chen and Liu, 1992 Author: Daniel Hu MIT-EECS Page 19 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 MMS (PERL) C and PERL - FLISP based sentence segmenter based or MMS3 (C) P-O-S 17tagging or Unigram PERL IPOS worls Figure 3 The three different modules that can do segmentation combined with a LISP based part-ofspeech tagging module. The alternative to treating segmentation and part-of-speech tagging apart is to build them into a single module. This method is effectively exploited by an off-the-shelf (OTS) software from Beijing University Institute of Computational Linguistics (PKU ICL)22 . The output of this system is not in the format that we can use in the Alembic processing environment. So post-processing tools are added to the data path to further expand the tags and create the desired data format. Beijing University Segmentation and Sentence tagging, POS software conversion, correction Figure 4 Installing Beijing University segmentation and POS software into Alembic. With the correct segmentation and part-of-speech information, the data is then fed into the Alembic rule based modules. The first pass is the named entities tagger just like the English system. Then it is passed on to the phraser, which expands the named entity tags into nominal phrases. All the while information is fed into the template. The exploration in this project is limited to finding the optimal solution to segmentation and part-of-speech tagging, and named entity task. 22 http://www.icl.pku.edu.cn/nlp-tools/segtagtest.htm Author: Daniel Hu MIT-EECS Page 20 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 3.4 Segmentation Problem in Chinese Predicating the understanding of processing techniques for Chinese, we first need to establish the basics of Chinese grammar and word. While Chinese have a rich tradition of phonology, grammar studies are fairly recent on the historical scale. The original studies on the Chinese grammar were done as the result of Western language influences. The concept of a word is intuitive to the native speaker, but an effort to establish a basic guideline for singling out a word and giving it a grammatical role was not undertaken until late 19 th century after the introduction of Western linguistic concepts. So grammatical concept in Chinese is not completely alien to the grammatical concept of romantic languages. Yet, there are great many instances of ambiguity when we attempt to dissect Chinese language using the basic linguistic rules provided for English. Thus arose the need to import a different set of guidelines for handling Chinese. The first difficulty in analyzing Chinese is segmentation and definition of a word. The most basic unit of the Chinese language is a morpheme 23 . A morpheme is one Chinese character. On a computer, Chinese morpheme can be expressed in GB and BIG5 encoding, where each character is made up of 8 bytes. In Unicode, this character may be encoded in either 16 or 32 bytes depending on which version of Unicode. There are bounded morphemes and free morphemes. Free morphemes are characters that retain an independent meaning and can appear on their own. Bounded morphemes are those that do not appear independently. An example of a free morpheme is A (gou - dog). While this character often appear on it's own, it can also enter into polysyllable combination with other morphemes to form words, such as iJ(mugou - bitch), or M (guoxiong - bear). Absolute bounded morphemes are much more rare in Chinese. An example of the bounded morpheme is V (jili - caltrop, a type of plant), where the first character M (ji) has no meaning independent from the second character. A sentence in Chinese is formed of a string of morphemes without space to separate them. The reader supplies 23 Jerry Norman, Chinese, 1998 Author: Daniel Hu MIT-EECS Page 21 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 intuitive boundary to the words and construct them into logical sequences. Classical Chinese literature would mainly use one morpheme unit as the basic word, such as the writings of traditional courtiers of imperial China. But preceding the republic era, Chinese scholars and writers began to adopt bahua, or a written form of the spoken language. Bahua contained many more polysyllable words. Slight variation on this historical language development survive today as the official language of China, called Mandarin. Mandarin possess many polysyllable words, yet, the written form of Mandarin follow the tradition of no space boundary between words24 . This poses a special problem to the linguistic computation, which bases much of the algorithm on the fundamental unit of a word. The most encompassing law for segmentation is "stringsof characterswhich are the smallest unit of meaningful sequences that plays the same role in or out of context". We will attempt to apply this rule whenever we can towards the problems of segmentation of Chinese. But there are many cases where this rules is applied with a lot of ambiguity. For instance, the sentence (zhong guo lian xiang gong si de fu cong jing li fang wen pu dong te qu - China Liangxiang company's deputy general manager visits Pudong special economic region) contains 7IjJT (fu cong jing li - deputy general manager) and ( (special economic region). If we apply the strictest sense of the smallest unit of meaning rule, we will get MIJ_3T (deputy-general manager) and 1AK(special-region). Yet, most linguist would agree with the first case of segmentation, but would disagree with the second case of segmentation. This is because of a second law. Which states that "bound morphemes together when, if treated independently, their meaning are highly adaptable". In this example, J [ 24 are treated as one word even though they are both free morphemes Jerry Norman, Chinese, 1998 Author: Daniel Hu MIT-EECS Page 22 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text because the [K is almost always presided by some kind modifier and 5/29/01 4$ easily form polysyllable words. In another word, both morphemes are very versatile2 5 . But this introduces a lot of confusion because the versatility of a morpheme is a highly subjective topic, and different linguist would have different ideas of how to segment the same sentence. Often, two annotators will have 10% disagreement over the same corpus. For the purpose of our project, we have not adopted any specific guideline. The rule of thumb will still be "smallest unit of meaning". In any case, the specifics of segmentation have more to do with easy of information extraction in the down stream processing modules. The word unit has to be small enough such that 1. accurate part-ofspeech information can be assigned and 2. we will not have to take words apart in later modules to do EDT and RDC. An example would be comparing MIj42K (fuxiaozhang vice chancellor) with 5i1]$- t3-K (xiaozhang - chancellor). We most definitely would want (vice chancellor) displayed as 11J$-K(vicechancellor) because it would be easier to establish relationship between vice chancellor and chancellor in downstream processing, instead of treating them as completely independent positions. The next immediate problem after determining what constitutes a word segment is how to actually extract this information out of a sentence. A native reader of Chinese will be able to determine the correct segmentation meaning from context. The corpus that we will be primarily working with will be official Chinese newswires and certain legal documents. These texts will be written in putonghua, or "common language". Minus some special articles and summary reports, the language used in these articles would follow journalistic tone, giving us ample opportunity to come up with common rules. In addition, the polysyllable words will be easily found in the dictionary. Two sources of the journalistic text are explored. One is the Treebank project annotation 25 Jerry Norman, Chinese, 1998 Author: Daniel Hu MIT-EECS Page 23 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 of Xinhua news agency articles, and the other one is PKU ICL annotation of People's Daily articles 2 6 . Frequency of word length is presented in the following table. Total words Xinhua News Agency 82,148 People's Daily 907,017 1 morpheme 2 morpheme 3 morpheme 4 morpheme 5 morpheme 27,187 47,216 6,330 857 468 337,724 492,879 51,116 19,375 5059 33.1% 57.5% 7.7% 1.0% 0.6% 37.2% 54.3% 5.6% 2.1% 0.6% Table 2 Word length distribution of Xinhua News Agency and People's Daily. Comparison of segmentation results from U. of Penn. Treebank project and PKU ICL annotation. 26 See Appendix El, E2 Author: Daniel Hu MIT-EECS Page 24 5/29/01 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 3.5 Guideline for Tagging Documents Before introducing the part of speech tagging and phrasing problems, we will introduce the SGML tagging guidelines. For start, every article needs to have a correct set of HTML tags. Every processing module looks for <DOC> and <TXT> tags to begin processing. Instead of putting everything into words, a table is provided below to illustrate all the tagging needs. Module output Newswire data Tags Example (dirty) HTML <DOC> missing or miss- <TEXT> labeled tags p<> </DOB> Zoning (clean) HTML <DOC> <TEXT> Mihfwlm4. </TEXT> </DOC> Sentence <S></S> <DOC> <TEXT> Segmentation S>$$iha] $.</S> </TEXT> </DOC> Segmentation <LEX POS="v"> <DOC> and P.O.S. <fLEX> <TEXT> <S><LEX POS="nr"> 94-M-</LEX><LEX tagging POS="v"> ijwf</LEX><LEX POS="nt"> M$L</LEX><LEX POS="w">. </LEX></S> </TEXT> </DOC> Named Entity <ENAMEX TYPE Task ="ORG"></ENAMEX> <DOC> <TEXT> <S><ENAMEX TYPE="PERSON"><LEX POS="nr"> 94EU-</LEX></ENAMEX><LEX POS="v"> tj )5]</LEX><ENAMEX TYPE="LOCATION"><LEX POS="nt"> ltJ</LEX></ENAMEX><LEX POS="w"> </LEX></S> </TEXT> </DOC> Table 3 Author: Daniel Hu MIT-EECS Page 25 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 3.6 Part of Speech Tagging Problem for Chinese Grammar is a direct result of the identification of a word. Sequences of words placed together in coherent and consistent way must be linked together through the concept of part-of-speech, because each word has a role of gluing the other words together or taking an active part in the meaning of the sentence. This concept is as Western to the Chinese language as is the concept of a word. But it has also been adapted to fit under the framework of Chinese language. In today's linguistic research community the major word classes (part of speech) enjoy fairly high degree of agreement. The scheme outlined below is close to that found in several widely used grammars. There are six full word categories: nouns, verbs, adjectives, numerals, measures, and pronouns.2 A noun is something that can be directly counted with a numeral: --V V (buqu (sanliangche - 3 cars). A verb is a word that can be negated by T(bu - not): T not going). Adjectives are generally considered to be a relative to the verb. They can also be negated by T(bu - not) and can function independently as predicates. Numerals include classes of Chinese numbers and estimations of quantities, in addition to roman numerals. Measures are words that indicate unit for each noun. In the previous noun example, --E Mf (sanliangche - three cars) is actually formed of (san-three), &M (liang - measurement for vehicles), and V (che - car). Measures have similar function as "sheets" in "three sheets of paper" in English, but they are much more prevalent. Pronouns are substitute for a person or things. Syntactically they generally behave like nouns, but they do not admit modification. The third person pronoun is gender specific in certain writing, and unspecific in others. They are ft!, 4t, 'E (ta - he, she, it). Other pronouns such as demonstrative pronouns include (LX - zhe, this) and (E - na, that). 27 Jerry Norman, Chinese, 1998 Author: Daniel Hu MIT-EECS Page 26 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text Other words that commonly included with the pronouns are (i-h - 5/29/01 shengme, what), (ME - self), and (t - every). The functional word classes are adverbs, prepositions, conjunctions, particles, interjections, and onomatopoeia.28 Adverbs modify verbs and adjectives, and are usually bounded. Chinese prepositions originate from verbs, and are used to show various grammatical relationships. Conjunctions form different kind of connections: fn (he and), 1ThJJ (erqie - moreover), MJ (yinwei - because), j.J)f(suoyi - therefore). Particles are generally monosyllabic words used to show different grammatical relationships or subjective and modal overtones: W,] (de), 1E (ba). Interjections functions as asyntactic forms used to express warnings. Phonologically they contain irregularities such as elements that are not part of the regular phonemic inventory. Onomatopoeia are words which mimic the sounds of things. Syntactically they behave much like adverbs or nouns. As in English, there are considerable overlaps between the different classes. One word or morpheme may belong in several categories. An example is JtIv'F (piping criticize), which in this case is a verb. But we can use the same ILLW as a noun which means criticism. Below are two sentences that uses the same word with different part-ofspeech: R ffTAiA f]M@ 4IkW iT7JO (We should criticize this kind of behavior) lkw9 fPR-EfO (Your criticism is justified) This kind of overlap is a lot more prominent in classical Chinese literature. But in today's written language, it is no more prevalent than that of English. Thinking back towards the segmentation and part of speech tagging diagrams, we will have to build post 28 Jerry Norman, Chinese, 1998 Author: Daniel Hu MIT-EECS Page 27 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 processing modules for any imported software to make the necessary corrections. In the example above, we can write a rule that says that any segment tagged as verb would have the tag changed to noun if it came after the character W'J (de). This is just a brief summary of all the possible part-of-speech designations in Chinese. There are almost always subdivisions and new categories created when we deal with different linguistic camps. Two main groups we have looked at are the University of Pennsylvania's Chinese studies department and Beijing University's ICL. University of Pennsylvania developed their part of speech tag set based on the requirements of the Chinese treebank project, an effort to completely parse Chinese sentences. PKU ICL developed their part-of-speech tag set mainly to aid computational linguistics. The tags from Beijing University are presented below. These are more important to our system because majority of the training was done based on corpus with these POS tags. Author: Daniel Hu MIT-EECS Page 28 5/29/01 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text PKU POS Definition na Mg vd nr a ns nt b c d vn e ad f h i 1 Noun used as adjective Numeral used as a noun Verb used as adverb Personal Name Adjective Nominal location Nominal organization Negative Conjuction Adverb Verb that can also be a noun Another form of interjection Adverbial adjective Location Head Idiom Transient word n Noun an o p q r s t u v y Adjective that can also be a noun Onomatopoeia Prepositional Quantity measure Pronoun Space Time Auxiliary Verb Tone Figure 5 A partial list of PKU POS tags. 3.7 Phrasing Problem 29 Traditionally the phrasing problem for Chinese is viewed as two separate tasks. The first pass is the Named Entity Detection task as defined under the Multilingual Entity Task guidelines. The second pass is Entity Detection as defined under the EDT and RDC guidelines. The difference between those two tasks is coverage. The NE task has a more narrow focus, aimed at identifying unambiguous entities in any corpus while ignoring unspecific mentions. EDT and RDC task guideline has a broad focus. The end goal of EDT and RDC entity task is to find specific mentions (as specified in the NE task), and expand their association to unspecific mentions. So the EDT and RDC phrasing EDT Metonymy Annotation Guidelines Version 2.1 (2001), Multilingual Entity Task (MET) Definition for Chinese (1996) 29 Author: Daniel Hu MIT-EECS Page 29 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 guideline would cover specific mentions as well as some reference mentions. There are two different scoring processes for the NE task and EDT. They both take machine generated phrase tags and compare it to hand annotated phrase tags, and calculate a percentage called the f-measure (more about f-measure will be presented in the methods section, 4.5). The hand annotated phrases need to abide by the different task guidelines depending on which task we are trying to accomplish. Under the sponsor's contract, our end goal is EDT and RDC. Traditional NE task is of less importance to the whole process. So our first phrasing pass is a hybrid of the traditional NE task and the complete EDT named entity detection task. In this design, a more inclusive second pass will make corrections on the first pass, and complete the EDT named entity phrasing task. This hybrid NE task is similar to the traditional NE task in that it only allows single NE tag around any phrase. The markup is also based on the <sgml> tag set used for NE. Traditional NE task can be broken down to three subcategories. These subtasks are responsible for capturing entity names, temporal expressions, and numeral expression, respectively. Entity names are mentions of organization, persons, and locations. We will present some of the entity tagging in more detail in the following chapters. But first we will take a closer look at the simpler task of marking up temporal and numeral expressions. Temporal expressions are for "absolute" and "relative" time expressions3 O. In the 6t' Message Understanding Conference, the NE tagging guidelines for English only specified the tagging of absolute time expressions. This standard is relaxed and expanded for Chinese to include relative time expressions 31. Absolute time expressions include the any specific mentions of time, date, and duration that are independent from the document. Relative time expressions include all other specific mentions of time such as "yesterday", "last month", "in a few hours". Just like English, time expressions in Chinese have exclusive units that can be anchored to achieve high performance relatively easily3 2. The tagging process will be presented briefly in the methods section. But it will not be part of 30 31 Multilingual Entity Task (MET) Definition for Chinese, 1996 Multilingual Entity Task (MET) Definition for Chinese, 1996 32 http://www.MandarinTools.com Author: Daniel Hu MIT-EECS Page 30 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 the result because detailed analysis for the relatively simple task is not needed. Numeral expressions are mentions of any monetary amount or percentages. Similar to the temporal expression task, there are clear anchors to be used when dealing with these phrases. Rules which are used to tag these expressions will be briefly dealt with in the methods sections. The more difficult task is the entity task. The uniqueness of Chinese means that much of the concepts which are used to handle the English entity detection task can not be applied. Chinese entities types have to be examined independently from their English counter parts. The three subcategories for entity in Chinese are persons, locations, and organizations. The hybrid technique that we have adopted for the first phrasing pass is also responsible for tagging some of a fourth category, namely Geo-Political-Entity (referred to as GPE from now). We will examine the problems with each of these categories separately. As reader knows, Chinese do not have "capitalized" characters, so personal names can not be picked out like their English counterparts. On top of that, this is also a problem for segmentation because Chinese sentences contain no spatial separation. Political entity and locations might be resolved on the segmentation level with a good inclusive list, but personal names are a completely different story. There is an unlimited number of personal names possible, and these include all the Chinese names as well as translations of English names. So the segmentation of Chinese and English personal names have to be solved at the segmentation level. There are rules to forming a Chinese name, though, and a detailed description of how these rules are exploited is provided in the method section. To give a light overview: the method takes advantage of the fact that Chinese surnames come from a finite list, and given name is usually no more than two characters long. This method does not work for foreign names. Fortunately, foreign names are usually sound translations that derive their phonetics from a finite set of Chinese characters. So we have methods that work with this characteristic to tag foreign names. In summary, a Author: Daniel Hu MIT-EECS Page 3 1 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 majority of personal names will be resolved at segmentation level before the LISP system. The tags assigned to names at the segmentation level will be LEX tags with personal name designated "part-of-speech" information. Based on the "part-of-speech" information, phrase tags will be added around the personal names. At this stage of tagging, titles will not be included as part of the person, nor will other entity modifiers. Phrasing of locations and organization is done in the LISP based rule modules. Unlike personal names, word components to locations and organization are already segmented when they have reached the phraser. The phraser will need to identify all the components that make up the organization or location and put them together into a phrase. The general rule of thumb is to put together things that are not separated by possessive particles. M B W 9 ( meiguoguofangbu - American defense department) should be treated as organization, where as M B_'j_ (meiguo-de-goufangbu - America's defense department) would tagged separately as location and organization 3 3 . Or if a superior organization or location precedes the immediate mention of a subordinate or representative organization, and they are not separated by a possessive, they should be tagged as one34. An example of that would be rpiEJf S3I2 % stationed in US): Chinese government ('t' B (Chinese consulate ) is separated from consulate M3t by the verb Si (zhu - stationed), but since this is not a possessive verb, the primary organization is combined with consulate to form a single organization. Similar rules apply for location. ' BIL- (zhongguobeijing - Beijing, China) is tagged as one location because they are not separated by a possessive. 33 Multilingual Entity Task (MET) Definition for Chinese, 1996 34 Multilingual Entity Task (MET) Definition for Chinese, 1996 Author: Daniel Hu MIT-EECS Page 32 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 Generic mentions are in most cases not tagged. Generic mentions constitute mentions of organization and location that is not identifiable out of context. Examples of such mentions are t (fandudang - opposition party) and '7i}) (bangongshi - office). The MET2 guidelines for NE tagging in Chinese specified that US government is treated as a generic mention and not tagged. This rule is relaxed when we built our hybrid NE system. Any location linked with a government office is tagged as an organization. The MET2 guidelines further specified that each word may only have one NE tags. This rule is adhered to. Our tag set includes only ORGANIZATION, LOCATION, and PERSON. Any word will only be wrapped with one of these tags. Under EDT guidelines, locations that make reference to an Organization or broad organization mentions are tagged as a GPE. US Government is one example of a GPE. Another example of GPE is n 'i (baigong - Whitehouse). In our NE task solution, we handled part of the GPE tagging by tagging t 2 as a location, and tagging US Government as an organization. This is far from complete, however, and much more needs to be done before EDT entity detection task is satisfied. 3.8 EDT and RDC We will now talk briefly about the EDT and RDC task for the benefit of discussion. There are no results to show for EDT and RDC, but it is important to understand the final goal of the project in order to appreciate the intermediate steps. EDT and RDC stand for Entity Detection and Tracking, and Relations Detection and Characterization. The goal of EDT is to identify a specific entity and track its activities or characteristic through documents. The EDT task is actually a complex of different tasks. 3 5 1. The detection of entities. This is the bases on which all other tasks are built. This is for the detection of specific entities. The named entity phrasing module 35 EDT Metonymy Annotation Guidelines Version 2.1, 2001 Author: Daniel Hu MIT-EECS Page 33 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 completes part of the task, although another processing layer need to be added to clarify between location, organization, and GPE. 2. The recognition of entity attributes. This is partly integrated with the first part. Distinctions need to be made as to the type of named entity. The tag set expand slightly as more specific labels are assigned. 3. The detection of entity mentions. This measures the system's ability to associate a set of mentions with a specific underlying entity. This includes pronominal mentions. 4. The recognition of mention extent. 36 The RDC uses information from EDT and advance it one step further by finding relationship between the different entities that are supported by explicit or implicit textual reference. The target relations will be Role, AT, and Part, respectively linking person with organization, organization or person with location, and person or organization with person or organization. All three of the relations will contain fields for temporal attributes. In addition, Role relation will also carry Role attribute to classify the connection between two entities. 37 To satisfy EDT and RDC requirements, phrasing capability would have to be expanded to include title, pronominal mentions, some generic mentions, and all GPE. To this point, we have completed only part of the GPE tagging beyond the traditional NE task. But we believe that techniques developed for NE task may be easily adapted to building more sophisticated rules to tag the other mentions. 3.9 Treebank and PKU Data University of Pennsylvania and Beijing University (PKU) have independently handannotated data available as a resource to computational linguists. This is a valuable tool for machine training over wide range of issues. University of Pennsylvania's effort was 36 3 EDT Metonymy Annotation Guidelines Version 2.1, 2001 Preliminary Guidelines for Relation Detection and Characterization (RDC), Draft Version 2.2, 2001 Author: Daniel Hu MIT-EECS Page 34 5/29/01 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text aimed at creating a 100,000 word corpus of Mandarin Chinese text with syntactic bracketing. Their effort consisted of two stages: the first stage is word segmentation and part-of-speech tagging and the second stage is syntactic bracketing. Beijing University's annotation effort was focused only on segmentation and part-of-speech tagging, although later efforts had basic phrase identification. The size of their corpus is much larger, at around 1 million words. Both research facilities used Chinese News data. University of Penn annotated Xinhua News Agency articles. PKU annotated one month of People's Daily. The tone of both source is very similar. For our purposes, segmentation and partof-speech tagging was enough. The so the University of Penn data was stripped of the syntactic bracketing before being used for POS training and evaluation. The segmentation guidelines for both annotation efforts are similar. But there are small differences, and this is reflected in the percentage of words of different length (see table 2). Part of speech tagging is also different. Since University of Pennsylvania extended their effort to include syntactic analysis, their part of speech had less detail, and relied more on syntactic labels. PKU relied exclusively on part of speech and some limited nominal differentiation to carry all of the information, therefore they developed a larger tag set. Table 4 presents their POS tags in detail. The following is some sample data. Larger samples can be viewed in Appendix E. University of Pennsylvania Treebank Data: <S> ((IP (NP-SBJ (NP-PN (NR r5f)) (NP (NN RMX) (NN fA$) (NN siLL))) (VP (VP (VV ) (AS T) ) (NP-OBJ (NP-APP (NN i (PU, ) (NN j-) (PU, ) (NN '() (QP (CD At) Author: Daniel Hu MIT-EECS Page 35 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 (CLP (M T) (NP (NN Pr)))) (PU , ) (VP (NP-TMP (NT 8 i)) (ADVP (AD IE)) (VP (VV %<i2)))) (PU </S> ))) Input into Alembic system would appears as: <S><LEX POS="NR">$i</LEX><LEX POS="NN"># </LEX><LEX POS="NN">f*t</LEX><LEX POS="NN">-Sk</LEX><LEX POS="VV">, </LEX><LEX POS="AS"> T </LEX><LEX POS="NN">PjAii</LEX><LEX POS=" PU">, </LEX><LEX POS="NN">b &K</LEX><LEX POS="PU">, </LEX><LEX POS="NN">> </LEX><LEX POS="CD">TL</LEX><LEX POS="M">T-</LEX><LEX POS="NN">[ fr1</LEX><LEX POS="PU">, </LEX><LEX POS="NT">HH 1t</LEX><LEX POS="AD">IE</LEX><LEX POS="VV">) Y 4,7 </LEX><LEX POS="PU">, </LEX></S> Beijing University People's Daily Data: 19 9 80110-01-001-023/m O/v DO/v 10/v 001/n ODD/vn, /w D0D/vn, /w 000/n O/w Input into Alembic system would appear as: <S><LEX POS="v">W</LEX><LEX POS="v">j Htl</LEX><LEX POS="v"></LEX><LEX POS="n">'JN'-</LEX><LEX POS="vn">fM</LEX><LEX POS="w">, </LEX><LEX POS="vn">f- $$</LEX><LEX POS="w">, </LEX><LEX POS="n">Lz~ji1L</LEX><LEX POS="w">, </LEX></S> Author: Daniel Hu MIT-EECS Page 36 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 4. Method The methods section will attempt to provide reader with details on how we developed various solutions to the problem and questions framed in the background section. We will be using tools developed by author of this paper or adaptations of resources available from other research labs. The organization of the methods section roughly corresponds to the flow of data through the system. The first section deals with sentence tagging and paragraph tagging, a prerequisite for anything passing through to the Alembic processing environment. Then we describe our segmentation efforts, going into detail about how we integrated three different segmenters into the system, along with a description of the OTS software from Beijing University. Imbedded in the segmentation problem is also the problem of name identification. The segmentation section will end with a conclusion and summary of pros and cons of the different methods we have examined. The next section provides solution to part-of-speech tagging. Through out the segmentation and POS section, we mention using various derived wordlists. The nest section will give details about how those wordlists are derived and merged. There is also an inventory of extra wordlists that will help the NE and EDT tasks. The next large step in this process, after Chinese text have been segmented and POS tagged, is NE tagging. We divided up the phrasing task into machine training, scoring, generating frequency data, manual worldlist development, and finally manual rule writing. Our efforts up till this point in time are satisfactorily described by the above summary. 4.2 Sentence tagging Sentence tagging is a relatively simple task for Chinese. In English, we run into problems with ambiguous punctuations such as Mr. or P.O. W. In Chinese such mid-ofthe-sentence punctuation would never use terminal punctuations such as the Chinese period (, ) or the exclamation mark or the question marks. The baseline for sentence tagging would be a simple algorithm that reads down the string, and whenever a period or an exclamation or an question mark is found, assume that the previous sentence have ended and a new sentence begun. This is a good way until we run into spoken text and Author: Daniel Hu MIT-EECS Page 37 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 casual writing such as email. There are two ways we have adopted in our code to overcome these problems. For the problem of speech endings, we have created a long list of possible punctuation strings. For the second problem, we listed a small group of terminal punctuations, and if only those appear in a string of punctuations, then call it the end of a sentence. We also run into the problem of headlines. Some text have headlines between in the <TEXT></TEXT> region, and those rarely have end-of-sentence punctuations. In order to minimize confusion to the segmenter, if sentence appearing on a line of its own and the $insentence token displays false, tag the whole line as a sentence. There may be exceptions to the rule, but this works very well for all newswire text. To see sample output from the sentence tagging module, please refer to table 3 and sample text in appendix. 4.3 4.3.2 Segmentation MMS The baseline for Chinese segmentation is the maximum matching string (MMS) model which is implemented in Erik Peterson's code downloaded from the web. Chinese read from left to right. So, all of the multi-character words can be matched by scanning the text from left to right, and applying a greedy algorithm. All possible word matches are compared with the words in a lexicon. Each loop of the algorithm a morpheme is added to the string to the right of existing string, and if the new string exists in the dictionary, the character is added to the string. If the newly formed string makes no sense, then the old string would be segmented out as one word, and the new character treated as the leftmost character of a potential new word. Author: Daniel Hu MIT-EECS Page 38 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 Generic sentence: Our test sentence: Chinese agricultureis developing quickly. rJ (zhong - central) rp M (zhongguo - China) t _ (zhongguonong - makes no sense, so nong is separated) w _Vj f (*_nongye - * agriculture) m_ rpmL H 3_ _ (*_nongyefa - make no sense, sofa is separated _M_R&R (*_fazhan - *_development) _MVK_R tR P M_ _&R_ (* fazhanxun - make no sense, so xun is Figure 6 Basic application of Maximum Matching String algorithm This method works fairly well for short strings of one or two bounded morphemes. But the most obvious problem is among words that does not progress in a onemorpheme-at-a-time fashion. Idioms are common in Chinese. Majority of the idioms are composed of four or five characters. Many times the first one or two characters might make sense as a word, but how do we take a leap from one or two character string to a four or five character string? The key is building differentiation into the wordlist. When a lexical dictionary is initialized for computation purposes, we will create two states for each string. One is "this string exist as an independent word". The second state is "this string may exist as an independent word, but only if you add more to it". We will also add a temporary variable that can store the characters which are uncertain as to whether they will be added to the previous segment. So when a string is looked up in the lexicon, we get one of three possible states: "don't exist", "add more", or "this is a word". The figure below illustrates the application of this algorithm. Author: Daniel Hu MIT-EECS Page 39 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 Sentence with an idiom: S1-JittQII 0 (This is the plan to kill two birds with one stone) The idiom: (yijianshuangdiao - Killing two vultures with one arrow) Segment State add more Temp Variable add more -' (yi) (zhe-shi-yige-yi - this is a ... ) j__t-MZ~i (yijian) (zhe-shi-yige-yijian - this is a ... ) add more (zhe-shi-yige-yijian - this is a ... ) (yijianshuan) word (zhe-shi yige yijianshuangdiao this is a killing two birds with one arrow...) AML- 4 _ (yijianshuandiao) *cleartemp var don't exist NU'NW (zhe-shi-yige-yijianshuangdiaode) Wq (de) *cleartemp var _IJ jIM -_@ t (zhe-shi-yige-yijianshuangdiao deji) don't exist it (ji) *cleartemp var Figure 7 Working with idioms in the MMS algorithm This segmentation algorithm forms our baseline performance. But it will run into problems when the greedy algorithm cuts characters from other words, or bound free morphemes. An example of that can be found in the figure below. Original sentence VAkbhii 7 L (Thirty people attended the meeting) Greedy MMS output t_+X fTi_4tLO (Thirty ginseng... the sentence does not make sense) Correct segmentation Figure 8 An example of a common mistake that greedy segmenter will always make. The region of ambiguity is displayed in bold. Author: Daniel Hu MIT-EECS Page 40 5/29/01 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text Proposed solution to this common problem include incorporating a frequency model (in this case, "attended" will have a higher frequency of usage than "ginseng"), and utilizing part of speech information with a Hidden Markov Model, and applying heuristic linguist rules. We will examine some of those proposed solutions and their implementation in later sections. Numerous MMS implementation for segmenting Chinese are freely available. The version of code that we downloaded and modified is from MandarinTools.com. The source code is written in PERL. The original wordlist consist of 23,000 words. We downloaded alternative wordlists that ranged from 50,000 to 130,000 words. Wordlists are a critical part of the lexical based segmentation process. The coverage of the wordlists, the definition of a word underlying the wordlist, and the character encoding all play a determining role on the performance and how detailed is the segmentation. But when we use a particular wordlist, we also have to keep in mind the part-of-speech tagging that we are doing in later module. So the wordlist used for segmentation has to agree with the part of speech wordlist. The ideal is to have a POS dictionary with very good coverage, so during our segmentation process, POS dictionary can be directly used. MMS have built in numeral, time, and name identifier. The numeral and time segmentation is simply utilizes lists of time expressions, measure units, and Chinese numbers. In the greedy algorithm, if the string is completely formed from elements from one of the above list, then tag it as a time or numeral expression, respectively. During implementation and modification, we discovered there are numerous missing characters on these lists, and so we augmented list with our own. Personal name tagging is different. There is not easy solution to tagging personal names and the whole topic deserves a section of its own. Author: Daniel Hu MIT-EECS Page 41 5/29/01 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 4.3.3 Personal Names There are two categories for personal names: Chinese names and phonetic translations of foreign names. Techniques used in our segmentation tools to detect them are different, thus they will be presented in two separate paragraphs. The identification of Chinese names comes after the recursive loop that does the initial segmentation. The wordlists does not contain any Chinese name, so presumably, all the Chinese names are presented as unbounded morphemes. The process that follows will attempt to identify which of the unbound morphemes look most like a name sequence, and then merge them. Chinese names have a finite list of surnames to choose from, and this list of surnames have changed little over recent history. Characters that can make up given names are not limited except for some particles that absolutely can't be part of a name. But the length of the given name is limited to two morphemes. The most frequent length of a whole name is three. Any lexical base segmentation program would first look for single surname character that is not bounded to larger strings. Then, if it is followed by a title (looked up on a list of possible titles), or followed by one or two unbounded morphemes, or preceded by a name predicate, the string is most likely a person. This simple algorithm is coded into Erik Peterson's MMS script. As an extra layer of precaution, the name identification subroutine also looks at a list of characters that frequently appear unbounded, and which rarely appear in names. This list includes a lot of particles, interjections, and prepositions. The segment-then-merge approach of tagging names runs into problems when any part of the name characters can be bounded to boundary morphemes. An example of that is imbedded in following sentence: t FJIlI1 $1%-K2O (erzhong de wanglixue xiaozhang fayan - chancellor Wang Lixue of No.2 middle school speaks) f1 Tl (wanglixue) is the name of the chancellor. But when the name is put together with the title fK (xiaozhang - chancellor), we can form the new word S Author: Daniel Hu MIT-EECS (xuexiao - Page 42 5/29/01 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text school). A literature search revealed extensive study of this problem. One proposed technique uses frequency models38 . When an unbounded surname morpheme is found (it did not mention the case where the surname is bounded), two morphemes to the left of the surname are looked at. A list of character frequencies built on a very large bank of Chinese names is used to determine if these two morphemes are "likely" to be part of the given name. The likelihood of this whole name string is balanced by the probability of the morphemes appearing alone (this include the surname), and the probability of the leftover morpheme appearing unbounded. The general conclusion of this study is that over 90% of the time, given name characters are unbounded. Of the ambiguous cases, the second morpheme is likely to be separated from the polysyllable word and put together with the surname, and the third morpheme is likely to remain in any polysyllable word. Another problem with name identification is abbreviation. In some literature, the appearance of the given name on its own is frequent. This is not the case for newswires, however, as most Chinese names are written out in full. In some editorials, the full name may be used once, and then only given name is used later in the article as a sign of familiarity. Segmentation tools have very hard time catching this. Post-segmenterprocessing module for Alembic have a capability of remembering earlier mentions of the full name and make a decision on whether later strings contain characters that suspiciously match those of earlier mentions and tag them as personal names. Handling foreign names is different from handling Chinese names. Foreign names do not have finite surname list or length restriction. Fortunately, foreign names are usually translations that derive their phonetics from a finite set of Chinese characters. This set of characters range from widely used to characters that specifically mimic a sound. The baseline approach to tagging foreign name is to find a string of these foreign character that can not be placed in polysyllable words with boundary characters and put them together to form a foreign name. 39 This is done in a way similar to the algorithm used to tag numerals and time. There is still the problem of telling a foreign person name from a foreign location name. This is currently an unresolved problem. However, majority of 38 Large Corpus Based Methods for Chinese personal Name Recognition, Journalof Chinese Information Processing,6(3):7-15 39 Erik Peterson, 1996 Author: Daniel Hu MIT-EECS Page 43 5/29/01 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text the locations are followed with a location identifier, so the LISP post-processor is able capture false tags and convert them into the correct ones. Foreign honetics Surname Non-name Numerals Classifiers r7 rA L RX PU f Table 4 Sample list of the support wordlists used for segmentation. These are universal to MMS, Unigram, and Rule based system. 4.3.4 Heuristic Rule Based System The previous sections on maximum matching algorithm described the simplest implementation of a Chinese segmentation code. But towards the end of 4.3.2, we gave an example of an ambiguous case where MMS will fail every time. Without a look ahead model, those ambiguity cannot be resolved. Prior research with look ahead models has ranged from a simple two word look ahead, to hidden Markov models over the whole sentence. In the end, the cost associated with building a very complex system, namely speed, becomes inhibitive with only a very small gain in performance. When the performance has reached a certain level, the solution provided by a complex system for one type of ambiguity will usually create another. So what is the balance point between performance and speed? A high number of research facilities have settled on three word look ahead models. A paper in published by Chen and Liu in 1992 detailed one such segmentation algorithm. It is simple enough that processing time is not dramatically more than the simple MMS. It incorporates 6 rules to determine what is the best Author: Daniel Hu MIT-EECS Page 44 5/29/01 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text segmentation. 5 of those rules were applied in our processing module. The rules are listed below. Their rank of importance go from A to E. A. Most plausible segmentation is the three word sequence with the maximal length. With a simple look ahead, find the three word segments that , together, achieve the longest length. (correct) ABC_D_EFG (incorrect) ABCDEF B. When there is more than one way to choose the three word segments of equal length, pick the word chunks with the smallest standard deviation for the word length. (correct) AB_CD_EF (incorrect) AB_C_DEF C. Normal words get higher priority than determinative-measure compounds. So if AB_CD_E and AB_C_DE are the choices, and C is a measure compound, then pick ABCDE. D. Pick the chunks with the high frequency monosyllabic words. ( F is low frequency) ABCDEF (A is high frequency, then pick this one) A_BCD_EF E. Pick the chunks with the highest probability based on their part-of-speech 40 information. There is an implementation of this method available off of the web. The code is developed by Tsai for his Master's degree at University of Illinois. It was written in C and used BIG5 input and output. Our system used GB character encoding. So a PERL script was written to take GB input, convert it to BIG5 encoding, pass it to the segmenter, and convert the BIG5 output back into GB. Tsai's code was tested against the simple maximum string matching and, surprising, did not perform much better. While it resolved some ambiguities, it also created new ones. It was also lacking a good time and numeral and name algorithm. This problem is overcome by building the time and numeral and name subroutine similar to one found in MMS into the PERL shell script. This time, the performance was better than that of MMS, but still contained a fair amount of errors. From observing the two preliminary results, there is a proposal to add statistical model on top of rule A. We proposed that rule A combined with a simple statistical model might be sufficient overcome some of the problems and we only need to implement B and C to get significant performance improvement. 40 Chen and Liu, 1992 COLING Author: Daniel Hu MIT-EECS Page 45 5/29/01 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 4.3.5 Unigram Frequency Model The third segmenter fields a statistical rule in place of the maximum length rule of the second segmenter. The rest of the rules are similar. This modification is aimed at resolving some errors made by the first rule. Specifically, to deal with uncommon words that might have a longer length than common words. A. We will still find all the possible maximum string matching for three word segments. For example, for the string ABCDEFG, we have found the following ways to segment it into three word character sets. 3.ABCDEFG 2. ABCDEF 1. ABCDEF In the original rule, third segmentation string would be picked because it is the longest. In the new method we will observe the number corresponding to frequency of occurrence for each word. Let's call that number cc. We will also substitute a number rj for the length of the word. r is a constant for each word length that increases non-linearly with the word length. For each Chinese word, we will introduce a weight &11. (p is a set number depending on how much we wish to weigh the frequency factor. For each possible three word string, the weights of the words are added together and compared. If there are significant differences, either due to the frequency of occurrence or due to length of strings, we will pick the one with the highest weight. If difference is not significant, then we will resolve the conflict by picking the maximum string. If difference in weight is not significant and there are equal length strings, we will resort to lower level rules. The frequency data for words is found in a list downloaded from the web41 . There are huge flaws to this design. First flaw is the way the frequency data is obtained. The original segmenter with heuristic rules is used to scan through 17 million characters. The resulting segments are stored and counted to obtain the frequency. So the frequency count would be biased towards the mistakes that the rules-segmenter systematically makes, the same mistakes that we are trying to correct with segmenter that incorporates frequency data. Secondly, the wordlists used in the original rules segmenter does not agree with our segmentation guidelines. As a consequence, the wordlist with frequency data does not match the smaller wordlists that we would like to use to do our segmentation. We feel that the wordlist with frequency data does not follow "smallest unit of meaning". Thirdly, the original rules segmenter does not have a personal name 41 http://www.chinesecomputing.com/ Author: Daniel Hu MIT-EECS Page 46 5/29/01 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text identification algorithm. So we suspect that the frequency data would contain frequency of characters from personal names as well, shifting the balance of segmentation towards to monosyllable "words". In the end this trial at building a segmentation script is flawed, and will ultimately be discarded for other software with higher performance. 4.3.6 PKU OTS Software Up till now we have explored different segmentation codes. We have also done modifications, and implemented our own with combined resources off of the web. But the efforts focus only on segmentation. In the end, we would have to do part of speech tagging as well. However, both of these capabilities are found in the OTS software from Beijing University. We do not have the source code to PKU segmenter. Thus we can only speculate on how their segment is implemented and give some examples of where their software performs well, and some examples of where their software fails. PKU segmenter takes in an article or sentence without any extraneous tags and outputs segmented and part-of-speech tagged data. For some cases it performs rudimentary phrase tagging. We suspect that their segmentation and part of speech are implemented on two different levels, just like our system: segmentation on the first level, and part-of-speech on the second level. With the most basic testing and deduction, we know their segmentation algorithm incorporates a look ahead model. In cases where we fed sentence with ambiguities, it almost always came back with the more common word as the segmentation result. So we suspect that they use a frequency model to decide between conflicting words. They also have a fairly powerful personal name tagger. Behind this name tagger is the same algorithm surname-given name algorithm we have discussed in section 4.3.4. But it is more complex because there may also be a frequency model build into it. The following table illustrates some of the sample data that caused us to come to these conclusions. Author: Daniel Hu MIT-EECS Page 47 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text E+A)\H T Deduction Output sentence Input sentence tL -+/m A/n j) 5/29/01 /v T /u There is look ahead 1wcapability. /n ,l/n , /w nfjl fR] )1 R/d p A/n *)jip/v Q3/n Xf Frequency of occurrence is used when disambiguating conflicts. 1I±/p j -/r T1j/v Extensive name tagging capabilities. 7/u E+/m /q /w Frequency model built into E5 j/nr 4/v 1z7-K/n the name tagging system. /w T -++-E/nr e -Ei A T _T71 -E _ __/v /w Figure 9 Input and output of the PKU segmentation &POS processor. The inner workings of the part of speech tagging of the PKU software are harder to deduce from the output. However, we believe that it is also lexicon based, and have a frequency model. The part of speech assignments make systematic mistakes that we plan to correct using LISP rule based module, such as making nouns into verbs, or verbs into nouns. The PKU software also attempts to do basic level named entity tagging, although the named entity tags in the output is more of a distraction than anything we can use to satisfy the named entity task. The way they do is very rudimentary: taking a large list of organization and locations from the manually tagged People's Daily data and checks it against all the input. For example, the software would be able to catch -M M FI :,% (meiguoguofangbu - US Department of Defense) and tag it as one organization, but would tag j 7f1' / M P@2 (Australian Department of Defense) separately as a location and a noun. We build a LISP level NE task system based only on part-of-speech information and segmentation, and will not rely on their phraser. As a final note in describing the PKU software, it has different input/output requirements than Alembic system. In order to integrate this software to our system, a Author: Daniel Hu MIT-EECS Page 48 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 shell program written in PERL is built. This script takes sentence tagged articles and pass each individual sentences to the PKU segmenter, then take the output from PKU segmenter and format it into the LEX form that Alembic can read and use. Currently such a system is implemented only for their web-based demo, where we pass individual sentences to their demo site and retrieve the information we want from the resulting html page. We have not yet obtained their software license, so we don't have their source code or the actual program running on our computer, thus our testing has flaws. One of the biggest problems is word limitation built into their web demo. We cannot input more than 50 characters at a time. This cause the break up of some longer sentences along the 50 character mark, which potentially can throw off their segmentation process. The main reason we have tested their software extensively is that their software uses the same part of speech tags as their freely available tagged and segmented People's Daily corpus. This large corpus provides valuable training for the phrasing task, the LISP segmentation module, and POS module. Since the NE phrasing task is done with the assumption that the part of speech information will follow the same format as the output of the PKU segmenter, we are very keen on obtaining their software and making it a critical or optional part of our system. 4.3.7 Rule Based Alembic Segmentation Software One more avenue that we have explored in brief is a rule based segmentation module based on LISP. It is peculiar because in this system we are trying to merge characters rather than separating them. All of the sentences get parsed into character sequences where each morpheme is treated like a word before being passed onto the segmentation module. The segmentation module then uses a large word list to merge characters together to form words. This process is seriously explored only as a machine trained rule sequence. Training took place with 200,000 words from People's Daily, and after learning a sequence of 160 rules, the performance level was 83% for training corpus and 80% for a smaller testing corpus. Because the performance level was so low, we did not explore it further. But with the licensing of a larger and more comprehensive wordlist Author: Daniel Hu MIT-EECS Page 49 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 with POS information, we maybe able to go back to this original concept and attempt it again. More will be presented in the discussion section. 4.3.8 LISP Part of Speech Tagging We have presented in sections 4.3.2, 4.3.3, 4.3.4, and 4.3.5 how we have approach the segmentation problem independent of the part of speech problem. In this section, the LISP based POS tagging solution is presented in finer detail. We have decided to use Alembic rules environment to complete the POS tagging because: 1. it is flexible environment, we can easily change the wordlists, add wordlists, and delete them without changing much of the hard code, 2. it is trainable, and thus we can automatically generate rules by giving the machine all the resources that a human rule-writer would have and let it run over the large training corpus. For the benefit of training, we have adopted the same set of tags as those provided by Beijing University. The first step is to develop a POS wordlist to aid training and tagging. This is done by collecting all the words from the People's Daily data, and separating the words into files categorized by their part of speech information. One word may appear in several files since they may be used in several POS positions. The part of speech information derived from the People's Daily data is presented in the following figure: Author: Daniel Hu MIT-EECS Page 50 5/29/01 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text POS Total 1 Char 2 Char 3 Char 4 Char Other Ag Mg vd nr a ns nt b c d 321 7 528 35262 34473 27890 3573 8715 25479 47811 42734 27 5933 452 17201 47 3700 5033 8 10323 928 135 6015 236810 2837 70 39925 24244 32327 3836 20680 4562 74829 184764 1889 10 1391 496 1749 321 7 17 18537 12834 0 0 1913 18923 29845 127 23 496 112 12627 47 0 0 8 3723 928 135 0 16846 68 9 36339 22048 13800 0 0 4562 74363 64953 1857 10 0 496 1749 0 0 501 13927 21577 18584 7 6120 6228 16846 41747 4 5435 49 4499 0 2090 0 0 3896 0 0 26 186882 2769 51 3585 1909 18395 3655 14019 0 466 118616 32 0 963 0 0 0 0 10 1415 54 7297 3118 646 176 1117 857 0 0 133 75 0 1034 8 0 2169 0 0 686 26800 0 1 1 235 132 179 3579 0 0 1176 0 0 217 0 0 0 0 0 623 8 1354 416 26 152 3 3 0 2 50 0 0 437 4966 0 518 0 0 5104 5258 0 9 0 40 0 2 177 0 0 17 0 0 210 0 0 0 0 0 760 0 655 32 10 0 0 0 0 0 108 0 0 139 59 0 17 0 0 199 1024 0 0 0 12 0 0 2905 0 0 2 0 0 1 0 0 vn e ad nx f h nz i Bg j k Dg 1 n an o p q r s t Ng u v y Rg z Tg Vg Figure 10 PKU POS of speech and the number of words associated with the POS. columns represent the number of characters in each word. The character There are a total of 50,000 words. This covers most of the common words use in Chinese. However, this is not the complete list. The complete list would have close to double the amount of word coverage. Licensing for the complete list is still in the process of negotiation. In the meantime, the derived partial list is used to train the POS tagger. Author: Daniel Hu MIT-EECS Page 5 1 5/29/01 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text In training, we used the People's Daily's data. Purist would argue that we should not be using the same data as the data from which the wordlists are derived. From another point of view, we are trying to tailor the rules for the situation where we have a complete list instead of just a partial list. Theoretically the complete lists should have a very good coverage. Thus, using the derived list to aid in tagging the corpus from which it is derived from is a time saving measure. When we get the licensing for a good part of speech dictionary for Chinese, we will not have to learn the rules over again because we have much better coverage for a broad corpus selection. More about the part of speech results will be presented in the results section. 4.4 Phrasing system in Alembic and OTS software Our efforts for segmentation have been comprehensive. We have researched into and tested several systems that we might consider as part of the final product delivery for ACE Chinese. However, we are still in the process of evaluating our options, and have built a phrasing system fairly independent from the segmentation process. Our segmentation effort and part of speech tagging largely utilized the Beijing University's hand annotated People's Daily data 42 . Thus, the output from the segmentation and partof-speech effort uses the same set of part of speech tags as Beijing University4 3, and has the same segmentation guidelines. Our phrase training will be done with the same hand annotated data, and be based on the PKU POS tag set. Ideally, we would have segmentation and part-of-speech exploration completed first before going on the phrasing. This way, the results from the phrasing module would present the complete system results up to that step of the process. But time and resource restraint meant that we had to pursue phrasing and segmentation/POS tagging in parallel. So the training and results are all based on perfectly segmented and tagged corpus. To get a sense of the final results up to NE task, we would have to multiply the final score of 42 See Appendix El 43 See Figure 5 Author: Daniel Hu MIT-EECS Page 52 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 the segmenter and POS tagger with the final score from the phrase task. There are also benefits from building a system like this. As we explored NE phrasing, we develop a clearer picture of the patterns underlying name, organization, and location phrases, unclouded by systematic mistakes generated from the segmenter and POS tagger. 4.4.2 Training We took People's Daily data (one large file) and converted it into a format the Alembic phrase rule learner can read and saved into a database. The total database is around 3,000 files, or 3,000 articles. The file was named with by the date of the article, the section number to which it belonged, and the article number within the section. Then we cat'ed the 3,000 files into 10 large files and 10 small files. Those 10 large and small files are used for training and testing, respectively. To run the trainer, Alembic system is first loaded into a LISP interface. Then we load all the resources that the Alembic system is asked to use when learning, such as wordlists and list of tags. Finally we invoke the learning process at the LISP interface. The first training pass we did was cold training. All that was given to the system was part-of-speech tagged corpus. The machine had nothing else to work with. In general, the rules that it developed looked at the part-of-speech tags, and suffixes of Chinese words for indicators of name, organization, and location. The second pass of training, we gave the machine wordlists to work with. The wordlists included the surnames, POS wordlists, foreign characters, location names, and other lists. This time the system performed better. However, it was clear that it did not know how to use the full potential of the wordlist that was provided to it. In the third pass, we separated the personal name, organization, and location, and trained the machine on each one of them individually. This is to see which area needs to be working on the most to achieve better over all scores. Results from each of these pass and the section conclusion could be found in the results section. Author: Daniel Hu MIT-EECS Page 53 5/29/01 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text Supplementary wordlists Capitals.txt Classifiers.txt foreignl.txt & foreign2.txt Locations.txt Nations.txt Description List of national capitals Origin Translation of English wordlist Manually List of Chinese classifiers compiled by linguist List of Chinese characters Compiled from used for phonetic translation MMS, MET2 data, BBN data List of Chinese and foreign Compiled from locations MET2 and BBN List of nations Organizations.txt List of organizations Surname.txt List of surnames Title.txt and Title2.txt List of titles in Chinese data Translations of English wordlist Compiled from MET2 and BBN data Compiled from MMS, and People's Daily Manually compiled by linguist usstates.txt List of US States Translations of English wordlist zh-provinces.txt List of Chinese provinces Manually compiled by linguist Figure 11 List of the supplementary wordlist used for machine training. See Appendix B5 for translation code. 4.4.3 Testing and scoring and error tracking The Alembic system works with rules. Each rule is applied over the whole corpus, and no rule is ever repeated. The general trend is to go from broad rules to smaller rules. The rule learner functions much the same way. It is has a list of patterns that it will look for. For each rule that it generates, it will attempt to achieve the maximum gain in performance out of the list rules that it knows how to generate. It doesn't have any ability to anticipate future rules. For example, sometimes it is more beneficial to tag the specific things first, so that less general rules can be written in the future to avoid over Author: Daniel Hu MIT-EECS Page 54 5/29/01 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text tagging. The training system would almost always produce a smooth curve going from the broad rules to specific rules. Eventually it will reach a quasi plateau, when rules that it generates are so specific to the training corpus that it can't advance the performance on the testing corpus any more. We have graphed one training process below to show this curve. An example of the learned rules can be found in Appendix B2. Sample training results for NE-organization task 70 - _- _- _- - - _ - _- - - - - - - - - - 60 50 2 40 30 20 10 - fe t- O~ - M tn'~ - r- (= - M W) r- cll M t rt- CO Rule No. Figure 12 This shows an example of the training for organization tagging. Upper line represents performance on training data. Lower line represents performance on testing data. The measurement we used for scoring is a unit called the f-measure. F-measure is the average between precision and recall. Recall is the percentage of phrases that we have correctly tagged out of all the correct phrases. Precision is the percentage of correctly tagged phrase out of all the tags we have used. We can have a very high recall because we have tagged almost all the phrases, but still get a very low f-measure because phraser did a lot of extra flagging and the precision level is down. The goal is to achieve both high precision and high recall: produce a lot of tags and make every tag count. The fmeasure is what the machine looks at when learning. Author: Daniel Hu MIT-EECS Page 55 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 There are several scoring programs that we are using. The one build into Alembic is same one that the machine learner uses. The second one is an external script called scorify. The difference between the two scorers is the way it treats under-aligned or overaligned tags. In general scorify is more lenient with mis-aligned tags. It will count some of them as a correct phrase if the tags caught most of the words in the phrase. The Alembic scorer is strict with mis-aligned tags and will take a few percentages off of the fmeasure when compared with scorify. Scorify also have the advantage of being able to keep track of the all the phrases, including the miss-tags, and spurious tags. This is printed into valuable report. The following are three samples from our training and scoring task. Figure 13 is a log file generated by training system. It uses the Alembic scorer to generate the data. Figure 14 is the score results generated by scorify. This particular sample is only concerned with organization tags. Figure 15 is the sample report from scorify program. Applying learned rules to TRAINING SET sentences. No. 0 1 2 3 4 5 6 7 8 9 10 I I I No.: Key: Ans: Corr: Miss: Sprr: Key Ans Corr Miss | R(%) [Lbl] P (%) [Lbl] F-measur Fired Sprr S+-------------------------------------------+----------------------------------------------------3665 0 0 3665 0 I 0.0 --0 3665 1010 2655 3080 27.557981 2070 30.53 I 3080 32.792206 3665 3670 1010 2655 2660 27.557981 27.520435 27.54 I 590 3665 3670 2564 1101 2569 I 30.040928 30.000002 30.02 I 597 3665 3635 1266 2399 2369 I 34.542976 34.82806 34.72 1410 3665 3635 1427 2238 I 38.93588 2208 39.25722 39.13 I 248 3665 3628 1437 2228 2191 I 39.208733 39.6086 39.45 629 3665 3594 1482 2183 2112 I 40.43656 41.235394 40.92 | 454 3665 1465 2863 2200 1398 I 39.972713 51.170097 46.13 | 731 3665 2429 1440 989 2225 I 39.290585 59.283657 49.46 | 434 3665 2568 1440 2225 1128 39.290585 56.074768 48.06 I 139 Rule number Total number of phrases in the corpus Total number of tags generated by the rule Number of correctly tagged phrases Number of phrases missed out of all the phrases Number of spurious tags R(%): Recall - Corr/Key P(%): Precision - Corr/Ans F-measure: (R+P)/2 Fired: Number of times the rules is applied Figure 13 Sample log file from training. It shows increase in performance, the number of tags applied per rule, the number of correctly tagged phrases, recall and precision. This information is very valuable when used with rule sequence to trace what each rule did. Author: Daniel Hu MIT-EECS Page 56 5/29/01 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text * * * SUMMARY SCORES * * * POS ACTI COR PAR INC I MIS SPU NONI REC PRE UND OVG SUB ERR ---------- +------------------+-------------------+--------------------------------- I SUBTASK SCORES I I I I enamex organizatio 9341 684 01 0 01 0 01 0 801 person 0 0 0 location other 0 0 0 0 01 117 250 01 01 01 01 0 0 0 0 0 0 01 01 01 85 0 0 0 73 0 0 0 15 0 0 0 27 0 0 0 0 0 0 0 35 0 0 0 ---------- +------------------+-------------------+--------------------------------1602 186811292 0 761 234 500 01 81 69 15 27 6 39 P&R 2P&R P&2R 74.47 71.19 78.06 ALL SLOTS F-MEASURES Under Subtask scores, Enamex, Organization POS: Number of real phrases in the corpus ACT: Number of tags that the rules generated COR: Number of correctly tagged MIS: Number of missed phrases SPU: Number of spurious phrases Figure 14 Sample output from scorify program: *.scores ile output. Document 1998010103009 TAG TYPE TEXT KEYTYPE ENAMEX car car ORGANIZATION ORGANIZATION ENAMEX cor cor ORGANIZATION ORGANIZATION " ENAMEX cor cor ORGANIZATION ORGANIZATION ENAMEX mis mis ORGANIZATION "G"Z"" ENAMEX mis mis ORGANIZATION "i"" ENAMEX spu spu ORGANIZATION " " ENAMEX spu spu ORGANIZATION RSPTYPE KEYTEXT " RSPTEXT "$-- " " ". f " fn 2 f HR f Figure 15 Sample output from scorify program: *.report-summary file output. The information in figure 15 is very valuable in helping us generate manual rules. More about the process of manually generating rules and testing them is presented below. 4.4.4 Compiling the frequency data A detailed list of the results for the machine learned phrase tagging system will be presented in a later chapter. To summarize, the machine was unable to achieve very high performance for tagging organizations. So we had to find experiment with manually Author: Daniel Hu MIT-EECS Page 57 " Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 writing rules to tag organizations. A sample rule sequence is presented in the appendix section B3. The systematic approach to writing good rules for tagging organization is to find the most frequent patterns and write rules based on those frequently occurring patterns. A script was written to collect all organizations and their part-of-speech information. Then the part-of-speech patterns that had more than 50 hits were recorded. The sample results are presented below. <LEX POS="n">t$ </EX><LEX POS="n">AXP</LEX><LEX PoS="vn">rf@</LEX><LEXPOS="n">$ <LEX POS="nt"> t/4I</LEX><LEX POS="n">IR <LEX POS="n">! 1</LEX> </LEX> M</LEX><LEX POS="j">At</LEX><LEX POS="j">##?</LEX> <LEX POS="n">Y[K</LEX><LEX POS="spos">ij </LEX><LEX POS="n">iqIj fl</LEX> <LEX POS="ns">cP M </LEX><LEX POS="j">:Z]X</LEX> <LEX POS="ns"> </LEX><LEX POS="j">1j [</LEX> <LEX POS="n">5</ILEX><LEX POS="n'>Y IJ</LEX><LEX POS="n'>:[j-</LEX> <LEX POS="n'>M</LEX><LEX POS="j">[Ft3"</LEX> <LEX POS="nt">*7Cij ,</LEX> Figure 16 Sample of all the organization collected by scanning through the People's Daily data. 337 423 158 130 216 117 81 92 1364 57 160 89 60 130 64 70 2487 135 576 342 69 nz_n_ n n_ ns_n_n_n_ nsns_n_ nsJ_ n n_n_ ntj_ ns-n-vn-n_ ns_n_ nnin_n nt_n_ nsnt_ ns-nsn-n__ ns vn n_ ns-nzn-n_ nt nsnz_n_ nsnn_ n j nz_n_n_ Figure 17 The most frequently occurring patterns to organizations. Author: Daniel Hu MIT-EECS Page 5 8 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 4.4.5 5/29/01 Generating wordlists to aid tagging sequences In the first and second pass of the machine training, we have not achieved the desired results for NE task. The machine learning sequence was still geared towards English. However, even as the machine can not achieve the results we are aiming for, it is already noticing some particular traits of Chinese such as the suffixes that denote organization. This "knowledge" is not applied in a generalized way, thus it is unable go above a certain level of performance. We seek to take advantage of these "islands" and expand upon them tag organization, and achieve higher level results with locations. This effort resulted in a couple of key lists that is used for manually generated rules. There are the base suffix list for organizations and locations which is collected from numerous manually tagged corpus such as MET2 44 , and PKU's People's Daily. Then we have a list of frequent words that contain the suffix for organizations but are actually not organizations. The addition of these wordlist resulted in marked improvements for the manual rules. In fact, each additional entry into the wordlist results in a slight but noticeable improvement at this early stage. Below are sample entries from the lists. Locations Translation (example) Harbor (Hong Kong) Organization e t9] Translation Company Island (Hainan Island) Department River (Yangtze River) Central Lake (Kuiming Lake) Agency River (Panama Canal) FP 5 Stream Organization Village Gorge (Three Gorges) Group Department PGtM Government Figure 18 Sample list of Organization and Location names. 44 See Appendix E3 Author: Daniel Hu MIT-EECS Page 59 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text Non-location 5/29/01 Translation Figure 19 A list of frequently occurring words that include the organization suffix but Officer are actually not SSociety organizations. 7k A 4.4.6 Meeting Manually Generated rules Our manual rule generation focused on the poorest performing part of the NE task. It seeks to utilize all the information and resources that we have collected in the most effective way. The process of writing rules is a recursive process. First we write the rules. Then the rules are implemented over our test corpus. Next, we score the test corpus, and check both our results and the report-summary, with the missing and missaligned tag report. The report is carefully read through, and new items are added to the wordlists and new rules are written. There are two ways to approach tagging. One way is to identify islands within the phrases, and then expand left and right accordingly. The other way is to find the POS tags or words around the organization or location phrase that provide good hints. We achieved our current results by mainly operating under the island principle. Future work would probably require using both approaches. The first rule for tagging organization is a rule to capture all the words that end with the organization suffix. This action provides the "islands" off of which phrases can be expanded. Then we write a rule to exclude those words that definitely cannot be an organization but was tagged because it had the organization morpheme. Next we use the frequency data with part of speech information, going from the most specific order of POS tags to the most general. Finally we get rid of all the "islands" that did not fit under any of the organization POS format. The written rules for tagging organization are attached in Appendix B3. Author: Daniel Hu MIT-EECS Page 60 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 Running the rules required developing a specification file to call on the rules and to load the necessary wordlists. A sample spec file is attached in the Appendix B 1. Wordlists Alembic Rules Specification iterface Parameters, etc Author: Daniel Hu MIT-EECS Page 61 5/29/01 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5. Results There are two sections of results. The first section is a qualitative analysis of the segmentation and POS tagging phase. The second section is a more quantitative analysis of the NE task. The results from phase 1 and 2 are separate from each other. 5.2 Results from phase 1 5.2.1 Sentence tagging We achieved close to 100% tagging capabilities for sentences from newswire. A sample of this output can be found in Appendix Al. One of the reason for the high performance is because the People's Daily and Xinhua New Agency data are highly predictable and structured. The sentence tagging program had not been tested against data which had false return carriages or bad punctuation. Another reason why we had such good results is because we had not made the distinction between headlines and regular sentences. In a more formal evaluation, headlines have to be picked out from all the sentences. This task presents less of a challenge in English because headlines are simple capitalized sentences without a termination punctuation mark. For Chinese, which have no capitalization, this task is a project by itself. Since we are seeking to build an end-to-end NE system, not that much time could be spend the working on distinguishing headlines from regular sentences. Thus they are all treated like regular sentences. 5.2.2 Segmentation The output from MMS, heuristic rules, and unigram segmenters can be found in Appendix A2, A3, A4, respectively. The simple MMS segmenter downloaded from MaradarinTools.com is converted to output text forms that Alembic system can read. It is also outfitted with better numerals, surnames, foreign name characters, and measure words. The adaptation of Peterson's MMS segmenter performed surprisingly well, Author: Daniel Hu MIT-EECS Page 62 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 especially when compared with the preliminary results from the heuristic rules segmenter with the look ahead model from Tsai. The biggest draw back to the rules segmenter is that it did not have a good numeral or personal name identification algorithm. Many of mistakes assigned to the heuristic rules segmenter are solved by implementing a script on the output of the segmenter to merge names and numerals with an algorithm similar to the one found in the modified MMS segmenter. However, even with the script, there are surprising mistakes that the rules segmenter makes. Overall, aside from personal names and numbers, it was able to solve more problems than it created. (loose cargo container ships) Sample segmentation result for: Correct segmentation output <LEX>R M</LEX><LEX>Mfr</LEX><LEX> MMS segmentation output (incorrect because it was greedy) j</LEX><LEX>r</LEX><LEX> <LEX>R R Heuristic rules segmentation output (correct because of minimum variance rule) <LEX>MM</LEX><LEX>frL</LEX><LEX> Figure 20 Comparison result for MMS and Heuristic rule segmenter. Heuristic rule made the correct segmentation judgment. The implementation of the unigram model proved to be very difficult. The reason is the frequency list that we are using. Some of the potential problems have already being discussed in the methods session, mainly the ill match between the wordlist that contains the frequency data and the wordlist that we are using to build the segmenter. The result is that unigram model for segmentation functions just like the heuristic rules segmenter. The frequency for the word segments did not vary enough to make a real difference in most of the segmentation ambiguities. However, we will soon acquire a much better frequency wordlist from University of Pennsylvania. This wordlist is machine compiled and manually edited by linguistic researchers, and mostly agree with our segmentation guidelines. Integrating this wordlist into the unigram frequency model segmenter should produce results more in line with the expected improvement. Even though the theories behind the segmenters we built or modified represent the best ideas of the field right now, our segmentation results still lack behind that of Beijing Author: Daniel Hu MIT-EECS Page 63 5/29/01 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text University's segmentation software's 4 5 . The OTS software seems to have a better lexicon and a more sophisticated way to disambiguate conflicts. It also seems to have a better frequency wordlists, which helps at most times, and throws the segmenter off at some rare instances. See figure 9 for specific example. But unless we make significant improvements to our lexicon, and add a good LISP level correction module (see discussion), the software of choice is still PKU's segmentation/POS program. This view is reinforced when we take a closer look at results from training in part of speech. 5.2.3 Part of speech tagging Segmentation cannot be viewed independently from part of speech. Without a solid part of speech tagging module our segmentation effort is fruitless when we reach the phrasing level. We trained a LISP part of speech tagging module over a segmented corpus of the People's Daily data. The training took around 3 days. The learner generated close to 100 rules. The final result on the training data was 83.7% f-measure, and 79.2% f-measure on the testing data. This is significant because with this level of performance on machine generated rules, we believe manual correction or manually written rules would significantly boost performance. We do not have a quantitative measurement for the performance of the OTS software. From observation, it made some systematic mistakes. Most of the mistakes are not the result of miss-segmentation; they result from conflict between part of speech assignments. An example of that mistake is present in the figure below. POS Tagging Result for OTS software: f@ correct) Correct POS tags: fh/r Jlu lJtI/vn fR/d jE /a ; /w Miss tagged as: /r Jj/u J[LiT/v fR/d IE A/a , /w iZRIEi 0 (Your criticism is Figure 21 POS mistakes made by OTS software. 45 See Appendix A5 Author: Daniel Hu MIT-EECS Page 64 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 Such examples of mistakes are few though. They are usually due to a missing word in the part of speech lexicon. The general performance is higher than the machine learned rules. This will continue to be the case until we have better resources on which built better rules. In conclusion, the result showed us that we can build a rudimentary segmentation and part of speech tagging system based on the resources that we have derived or obtained off of free web sites. The performance of this system is on par with capable software in the industry right now. Still using the same framework, we can boost this performance by a huge amount if we can get better wordlists, and manually generate rules for the part-ofspeech tagging module. Despite this confidence, we also acknowledge that the current system is under performing compared the Beijing University's OTS segmentation and POS software. Thus, the Beijing University software should be licensed. There will be two main benefits. One is we already will have a state-of-the-art software to complete phase 1 of this project. Two is we will have a very complete part of speech dictionary and a finely segmented wordlist, and this will allow to a more complete exploration of our own approaches to phase 1. Author: Daniel Hu MIT-EECS Page 65 5/29/01 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5.3 Results from phase 2 5.3.1 Machine Learned Rules for NE Task F-measure 65.27% Resources None, no wordlists were provided. NR Cold train session. 80 NT NS NR 84.91% Part of speech wordlists derived from People's 150 87.58% POS wordlists 86.98% 10 NT NS NR NT NS NR NR 10 50 50 NT NT NT 41.50% 53.80% 57.24% POS wordlists Supplement lists Suffix lists POS wordlists Surname lists Foreign phonetic characters POS wordlists POS wordlists POS wordlists 40 50 NS NS 88.90% 90.50% POS wordlists POS wordlists No. Rules 50 Tags NT NS Daily 70 99.91% Org morpheme Loc morpheme list list Table 5 The training results for NE task. The above table shows the main results from training log for NE task. In the first row we showed an f-measure of 65.27%. This is training without any resources. All of the later training would take advantage of the wordlists derived from the PKU data. The fmeasure shown in all cases are results for the testing data, not the training data. Except for the first learning session, the training data always has a 3 to 4 percent advantage over the test data. This is attribute to the fact that the wordlists are derived from the training data, plus the rules are tailored specifically to boost performance in the training data. Author: Daniel Hu MIT-EECS Page 66 5/29/01 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text There are some clear trends in the numbers on the table. We will first focus on the processes that tried to learn rules for tagging all the named entities. In general, the performance goes up as the number of rules is increased, unless additional resources are added. We can see this in the improvement of f-measure from row 2 to row 3. There is a 2.67% increase in performance as the rules are almost doubled from 80 to 150. As the machine reach the limits of its intelligence, each additional rule only result in a very small increase in performance, sometimes not at all in the testing data. Thus we see the clear plateau. With the addition of new resources such as wordlists, the performance goes up even if the number of rules remains the same. This is illustrated in the 2.07% increase in performance from row 2 to row 4. The number of rules decreases from 80 to 70, but this is more than offset by the addition of useful new resources. After reviewing the tagged results from the first sequences of learning, we start training the machine with one tag at the time. Immediately, it is clear that the encouraging results we have are mostly due to the correct part of speech information from segmentation. Personal names accounts for majority of the NE tags (70%). An independent training on personal name achieved 99.91% f-measure within 10 rules. But we know personal names are already pre-tagged as NR. The surname and the given names are separate. But a simple rule that put these together would have resulted in a very high score. The second most prevalent named entity are locations. They are also pre-tagged in the segmentation process. Majorities of the locations are single words with NS as their part of speech. When the trainer learns how to independently tag NS, the first rule it generates is to convert the NS part of speech tag to NS phrase tag. This immediately takes its f-measure to high 70s. The recall is very high at that point. But precision is fairly low because it over flags all the locations that are used inside organization names. The rules following would mainly try to correct the over flagging. We can see that the location tagging learner achieved 90.50% f-measure in 50 rules. Author: Daniel Hu MIT-EECS Page 67 5/29/01 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text The only named entity task that the machine cannot "cheat" on is organizations. There are organizations that are only one word, but those are rare. Thus the learner cannot use the part of speech information directly. The result is a punishing score on the organization tagging task (57.24% with 50 rules). List of morphemes that denotes some kind of organizational structure was generated and given to the learner as part of its resource. This only resulted in a slight increase in performance. The learner does not really understand how to use this list to the maximum advantage. Manually generate rules that really take advantage of the wordlists and patterns can achieve high results with minimal rule sequences. 5.3.2 Improvements with Addition of Custom Wordlists and Manually Generated Rules for NE Task 18 rules were written for the NE task of organization tagging. Please see Appendix B3 for the complete list. The way these rules are written is discussed in the last section of the methods section. They achieved significantly higher results. By the Alembic scorer, the f-measure for organization was 73.97%. Scoring using the program scorify produce an f-measure of 74.47%. The report-summary showed all the correct, missed, and spurious tags. There were a lot of spurious tags, which keeps the f-measure low. However, linguists that looked at the spurious tags determined that at least 30% of them are actually legitimate. It is a case of disagreement over NE guideline with the People's Daily annotated data, which seemed to be inconsistent in many respect. The following table shows the kind of organization that they would tag in one article but would ignore in another. Author: Daniel Hu MIT-EECS Page 68 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text Organizations ,7Jd/k!Jf T T 7<L ;,' 3~~~ 5/29/01 Translation Waste treatment plant Banyanpu College Representative Body for Committee on Peace 2-6SI 2F_ K VP%54125Z Beijing Military Unit Literature Studies Society Central Military Task Force Figure 22 A partial list of tags marked as errors by the scorer because they are not tagged as organizations in the PKU's People's Daily data. The result shows clearly that we can achieve high performance for NE tasks in Chinese using the Alembic rules environment with the right wordlists and rule sequences. Noticeable improvements come with every extra addition to the still-limited wordlist. To avoid spurious tagging, a list of non-organizations with organization morphemes is generated to aid the rule writing process. Author: Daniel Hu MIT-EECS Page 69 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 6. Discussion 6.2 Separation of phase 1 and phase 2 When name entity detection is attempted in the Alembic LISP based system, correctly tagged corpus from People Daily was fed into it. It is point of controversy because this corpus is perfectly tagged. How will we know the real performance of the NE task? The reason why perfect data is passed onto Alembic system for training is a combination of time constraint and the effort to train a pure named entity tagger without error correction for the segmentation and part-of-speech. We wanted a modularize system. A post process module is envisioned to make corrections to the segmentation and part of speech tagging process. Our NE rules will be after the correction module, thus it shall be pure to the NE task. Whatever error correction could be done in the LISP rule based environment would have been done already. To build a modular NE phraser, the best data to pass it is the correctly tagged data. So what is the real performance of the system up to this point? There is no quantitative answer to that question. It will be answered as part of our future work. But it does not belong in this project report. 6.3 Segmentation Can segmentation be done using a rule based system? The initial results are not good. The instinctive answer would be no, because rules are applied over the whole corpus, and they do not facilitate scanning strings in a localized area. However, there maybe a curious to way to implement a rule based segmenter, and it would actually use frequency information and have a strange form of "look ahead". The corpus to be segmented will be passed onto the segmenter completely chunked into individual morpheme (the same format we are using to pass to Alembic segmenter right now). We can arrange the frequency wordlist according the frequency of words, Author: Daniel Hu MIT-EECS Page 70 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 starting from the most frequent to the least frequent. Then we will have one rule that will go down the wordlist and scan for each word in the corpus. If it finds words or morphemes that can be merged to form the word from the wordlist, it will merge them. It will run through each word in list and attempt to find a match. But it will only call it a match if the new word can be formed without tearing old bond apart. This way we will always be tagging the most frequently occurring words whenever there are conflicts in segmentation. We can also try the same segmentation algorithm, but arrange the wordlists according to length. This will be explored in future work. 6.4 Post-process for OTS module In testing the Beijing University software we noticed some systematic errors that it makes. These errors can be corrected using module in Alembic. One of the common errors is miss-tagging. They result from oversight when building the lexicon. One example of this is presented in figure 21. An rule that can correct this particular mistake is: (def -phraser-rule :anchor :lexeme :condition (:wd :pos :|vI) (:lwd-1 :pos :Iu) (:set :wd :pos :IvnI) :action ) This is just one of example of many errors that we can attempt to correct in Alembic. This should boost the performance of our segmentation and POS tagging. Author: Daniel Hu MIT-EECS Page 71 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 7. 5/29/01 Conclusion The mass of electronic information grows exponentially as more of the world becomes wired. Intelligent content extraction software that can help individuals and organizations make mission critical decisions based on this growing maze of information is becoming a necessity. A part of this effort focuses on building a multilingual information extraction system. This Master's project is a comprehensive study of the preliminary requirements to outfitting MITRE's Alembic Natural Language Processing tool with Chinese Automatic Content Extraction capabilities. It included the study of segmentation, part-of-speech tagging, and named entity phrase tasks. This project has successfully explored the essential task of segmentation and part of speech tagging for Chinese. We can draw the following conclusion from our results: 1. Existing segmentation algorithms have been implemented for the pre-Alembic process based on code written and modified through this project. Results are on par with industry standards. 2. Replacement of the wordlists and resources within the existing segmentation/POS system can results in significant improvements. 3. State-of-the-art software from external sources is shown to work well with the Alembic environment. Should we choose not to develop our own segmenter, license for OTS software can be readily obtained. This project has successfully explored the Name Entity task. We can draw the following conclusion from our efforts: 1. High performance can be achieved through a combined effort between segmentation, POS tagging, and rule based phrasing despite the unique problems facing Chinese text processing. 2. Manually generated rules and wordlists based on scientific analysis of named entities in Chinese can perform significantly better than simple machine learned rules. Author: Daniel Hu MIT-EECS Page 72 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 8. 5/29/01 Bibliography, Reference, and Resources 8.2 Web based resources http://www.mitre.org/resources/centers/it/063/alembic.html http://www.mitre.org/resouces/centers/it/gO63/workbench.html http://www.icl.pku.edu.cn/nlp-tools/segtagtest.htm http://www.icl.pku.edu.cn/Introduction/corpustagging.htm http://www.MandarinTools.com http://www.geocities.com/hao5 10/mmseg/ http://www.ldc.upenn.edu/ctb/ http://www.chinesecomputing.com/ 8.3 Academic papers and publications Benjamin K. Tsou, Hing-cheung Ho ... , A Knowledge-based Machine-aided System for Chinese Text Abstraction, In Proceedingsof the COLING 1992, Page 1039-1042 Benjamin L. Chen and Von-Wun Soo, An Acquisition Model for both Choosing and Resolving Anaphora in Conjoined Mandarin Chinese Sentences, In Proceedingsof the COLING 1992, Page 274-357 Chang, Chao-Huang, and Cheng-Der Chen, A study on integrating Chinese word segmentation and part-of-speech tagging. Communications of COLIP, 3(2), December 1993 Chao, Yuen Ren. 1968. A Grammarof Spoken Chinese. University of California Press, 1968 Chen. Aitao, Jianzhang Hu, and Liangjie Xu. 1997. Chinese text retrieval without using a dictionary. Proceedingsof the 2 0 th Annual InternationalACM SIGIR Conference on Research and Development in Information Retrieval. Philadelphia, Philadelphia, July 1997 Chu-Ren Huang, Keh-Jiann Chen, A Chinese Corpus for Linguistic Research, In Proceedingsof the COLING 1992, Page 1214-1217 Author: Daniel Hu MIT-EECS Page 73 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 Huang, Chu-Ren; Chen, Keh-Jiann; Chang, Lili; and Chen, Feng-yi. 1997. Segmentation standard for Chinese natural language processing. International Journal of Computational Linguistics and Chinese Language Processing. 2(2) , 47-62 Jerry Norman, Chinese, Cambridge University Press, 1998, Page 152-180 Jerome L. Packard. The Morphology of Chinese: A Linguistic and Cognitive Approach. Cambridge University Press, 2000 Jian-Yun Nie, Marie-Louise Hannan, Wanying Jin, Unknown Word Detection and Segmentation of Chinese using Statistical and Heuristic Knowledge, October 1995 Jones, Russell. 1997. Chinese Names: The Traditions Surrounding the Use of Chinese Surnames and Personal Names. Pelanduk Publications, Selangor Darul Ehsan, Malaysia, 1997 Jyun-Sheng Chang and Andrew Chang, A Statistical Approach to Machine Aided Translation of Terminology Banks, In Proceedingsof the COLING 1992, Page 921-925 Jyun-Shen Chang, Shun-De chen, Ying Zheng, Xian Zhong Liu, and Shu-Jin Ke, Large Corpus Based Methods for Chinese personal Name Recognition, Journalof Chinese Information Processing,6(3):7-15 Karen Spark Jones, Automatic Summarization: Factors and Directions, In Advances in Automatic Text Summarization, The MIT Press 1999: Page 1 -14 Keh-Jiann Chen, Chinese Sentence Parsing, PresentationSlidesfor COLING 92. Keh-Jiann Chen and Shing-Huan Liu, Word Identification for Mandarin Chinese sentences, In Proceedingsof the COLING 1992, Page 101-107 Kwok, K.L. 1997. Comparing representation of Chinese information retrieval. Proceedingsof the 20th Annual InternationalACM SIGIRE conference on Research and Development in Information Retrieval, Philadelphia, Philadelphia, July 1997 Liang-Jyh Wang, Wei-Chuan Li, And Chao-Huang Chang, Recognizing Unregistered Names for Mandarin Word Identification, In Proceedingsof the COLING 1992, Page 1239-1243 Liu, Shing-Huan. 1995. An Automatic translator between Traditional Chinese and Simplified Chinese in Unicode. Proceedingsof the 7th International Unicode Conference, San Jose, California, September 1995 Richard Sproat, Nancy chang, A Stochastic Finite-State Word Segmentation Algorithm For Chinese, 5 May 1994 Author: Daniel Hu MIT-EECS Page 74 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 Packard, Jerome L., editor. 1997, New Approaches to Chinese Word Formation, Morphology Phonology and the Lexicon in Modern and Ancient Chinese. Mouton de Gruyter, Berlin, 1997 Thomas Emerson, Segmentating Chinese in Unicode, Technical reportfor Basis Technology Corporation,In Proceedingsof 1 6 th International Unicode Conference. Wu, Zimin and Gwyneth Tseng. 1995. ACTS: An automatic Chinese text segmentation system for full text retrieval. Journalof the American Society for information Science, 46(2):83 - 96, March 1995 Yuan Chunfa, Huang Changning and Pan Shimei, Knowledge Acquisition and Chinese Parsing Based on Corpus, In Proceedingsof the COLING 1992, Page 1300-1304 8.4 MITRE reports David D. Palmer, A Trainable Rule-based Algorithm for Word Segmentation, In Proceedingsfor ACL-1997, 1997 David D. Palmer, Tokenisation and Sentence Segmentation, A Handbook for Natural Language Processing, 1997 David D. Palmer and John D. Burger, Chinese Word Segmentation and Information Retrieval, 1997 John Aberdeen, John Burger, David Day, Lynette Hirschman, Patricia Robinson, and Marc Vilain, Description of the Alembic System Used for MUC-6, Released for publication at 6 th Message Understanding Conference (MUC-6), 1997 Lisa Ferro, Marc Vilain, Alexander Yeh, Learning Transformation Rules to Find Grammatical Relations, Released to Computational Natural Language Learning workshop 1999 Marc Villian, Inferential Information Extraction, Internal Report and Guidelines. 8.5 Operating Manuals and ACE/MUC Guidelines ACE Team Members and Participants, EDT Metonymy Annotation Guidelines Version 2.1, Draft Released at ACE meeting 2001 ACE Team Members and Participants, Entity Detection and Tracking - Phase 1, ACE Pilot Study Task Definition, 2000 Ada Brunstein and Lance Ranshaw, Preliminary Guidelines for Relation Detection and Characterization (RDC), Draft Version 2.2 Released at ACE meeting 2001 Author: Daniel Hu MIT-EECS Page 75 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 Beatrice Santorini, Part-of-Speech Tagging Guidelines for Penn Treebank Project, University of Pennsylvania, June 1990 Fei Xia, The Segmentation Guidelines for the Penn Chinese Treebank (3.0), University of Pennsylvania, October 2000 Multilingual Entity Task (MET) Definition for Chinese, 01 March 1996 Shi-Wen Yu, Modem Mandarin Part of Speech Lexicon Guidelines, Beijing University Institute for Computational Linguistics, 1999 Author: Daniel Hu MIT-EECS Page 76 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 Appendix A Sample Segmentation Results Al Sample Sentence Tagged output <DOC> <METID> 001 </METID> <ID> raw/950715/tg/gnxw0718.15 </ID> <DATE> 1995 07 18 </DATE> <H L> Vf4-] *t </HL> <AU> Domestic News Dept. </AU> <TEXT> i8j, 7 - 1 6 H : 9 5 r3U ILj fRfMt -ff<LEX>. </LEX></S><S>±nU Wz, 6M , EX <LE, fSSnt E cNM A-i>&#, 6 L A# T Nf<LEX></LEX></S><S><IS> tJM9 E Ei 1 <S> >Ai 2 </LEX></S>2. <S>;2 ]Z<LEX>, <ILEX><IS><S>6<LEX>. et2W$fj@M : 53dB~iiitFMiM2%W]12)i5fliM M, 'L, 1 4 W Z _T<LEX>. </E>/><>*t% , I tt* #%85% <LEX >. </LEX></S>3. Ri, 1 5 H 9M-*5: <LEX>. 292 3 1 . ~2KN) 1 3 < LV'1)Eh$J 250n I 3 3 <LEX>. </LEX></S><S> 6 * </LEX></S><S>rLMj _T I < EX>, </LEX></S><S>iJ <Q-hjA sE 4fj <S>jE : H E~RzL<LEX>. </LEX></S><S> & II- P-fi)I X>-$2, IL := 7 -F -]E>T< 0 , HT -E PR -i-ZWAF$, A <LEX> <S >6 IXt <LX>. f X> </ E > </LEX></S><S> LE~f f <lg,&#1%& LE X</S>S>MM M& 22 2%M E[-]1322 1p , , EP,,jjT[E >W&A </LEX><t/S><S><LEX> -:A H $ 1j T-4E - i <ILEX><S M ffi-P)-<LEX>. </LEX></S> 8. <S>r$ff1:UAP iGA -i , 9. <S>tPVT%:$5M%%- EX>. <LX<S<>$aijggt2NMf2iN Author: Daniel Hu </LE></>6. /\S S> *4LE <, IE-M'&)tE<LEX>. </LEX><ILEX>I->, <. LEX> </LEX>< >. </LEX>> </S><S > < /S><$1 4 7R }J;), , ,_ t,< X )</LEX></S><S> >9. "S "±, i~fl > </LE X>< S><L>I.E/>< N~itBM~-jf)Qjj\.Ajt<LEX>. </LEX></S><S>tT -F[jVf) It ,{ ,3 : W~iM W 4%, $ 5536%&M e <L J@$ #W~N# I ~ WE X>.</L EX></S><S>-ft A RfW h d tg~tehh M R W %, &i1])jKA lr &M <L EX>. </L EX></S AMt $IAPJgt BRERR< tiMWMLX MIT-EECS Page 77 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text </LE></><SMMMM Pe, AWWg*f LTM,W-Y- iq, XRil~%yggiia-i<LEX>, 2k) J+2m, M~in, IV /I@53 \ q-IAL WA955 r0-Tt5Iji, jj/bR <LEX>, nA </LEX></S><S>jXLm II', :: kW/ dA L!-r_f-FS-EM2:V t-M { O<LEX>, 17 /d 00 7fl0 M--T%<LX>.</LEX></S><S>A)NflLIhfT </LEX></S><S> 1 0 r LVz <LEX>, </LEX></S><S>R 5/29/01 M' LV17T*/1\, jHfli LEX ></S<Ij-J < EX> </LEX></S><S>JL </L EX>/S<S J'J@LW'j± Js*9F-f1 - !T LE @t±]- X>. </LEX></S> 1 8/7) (fW N </TEXT> </DOC> A2 Sample segmentation output from MMS <DOC><TXT> <LEX>001 </LEX> <LEX>raw/950715/tg/gnxwO7l8.15</LEX> <LEX>1 995</LEX><LEX>07</LEX><LEX>1 8</LEX> <LEX>fT-[J-</LEX><LEX>*-t</LEX> <LEX>Domestic</LEX><LEX>News</LEX><LEX>Dept.</LEX> <LEX> 1</LEX><LEX>. </LEX><LEX>1 TYPE="PERSON"><LEX> W 9 5 </LEX><LEX> </LEX><ENAMEX </LEX><LEX>jE </LEX></ENAMEX><LEX>K1%</LEX><LEX> : </LEX><LEX> </LEX><LEX> E</LEX><LEX>11 </LEX><LEX>$f</LEX><LEX>R </LEX><LEX>/M</LEX><LEX>M</LEX><LEX>Maff</LEX><LEX>, </LEX><LEX> 7 </LEX>< LEX>TJ</LEX><LEX> 1 6 </LEX><LEX> H</LEX><LEX>t</LEX><LEX>$$$t</LEX><LEX> L-</LEX><LEX>-</LEX><LEX>Rff</LEX><LEX>. </LEX><LEX>, LEX> > </LEX><LEX>$Fi1</LEX><LEX>, 1M </LEX><LEX>, </LEX><LEX>@ LE </LEX><LEX>* n</LEX><LEX>Y ATM </LEX><LEX> F] 1r</LEX><LEX>, </LEX>< </LEX><LEX>Elip )g</LEX><LEX>, </LEX><LEX>jf)jpJn</LEX><LEX 4j</LEX><LEX>@</LEX><LEX>A. ±</LEX><LEX>Th</LEX><LEX>@E</L EX><LEX>ro</LEX><LEX>Viq</LEX><LEX>,W</LEX><LEX>A i±</LEX><LEX>fp</LEX><LE X>,. W</LEX><LEX>E A</LEX><LEX> 6 - -</LEX><LEX>AX#</LEX><LEX>u T </LEX><LE X>ff-</LEX><LEX>. </LEX><LEX>E</LEX><LEX>2</LEX><LEX>$fjh&</LEX><LEX> Mj</LEX><LEX> rc' </LEX><LEX>) *rT</LEX><LEX> _w</LEX><LEX>, </LEX><LEX>F W</LEX><LEX>jM</LEX><LEX> 4'</LEX><LEX>rj</LEX><LEX></LEX><LEX>)j </LE X><LEX>2{B</LEX><LEX>-$R</LEX><LEX>fn</LEX><LEX>RM</LEX><LEX>@</LEX><LE X> R</LEX><LEX>MfM</LEX><LEX>. </LEX><LEX> 2 </LEX><LEX>. </LEX><LEX>*- -</ LEX><LEX>_ t</LEX><ENAMEX TYPE="PERSON"><LEX>lEfjg</LEX></ENAMEX><LEX>$g</LEX><LEX> Author: Daniel Hu MIT-EECS : </LEX><LEX> Page 78 5/29/01 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text ni w</LEX><LEX>, q IL</LEX><LEX> L</LEX>E7fX> </LEX><LEX>32-I</LEX><LEX>j</LEX><LEX>@ it</LEX><LEX>/LE></LEX><LEX> 9</LEX><LEX>]</X><LEX>LX LEX><LEX> 1 5 7 </LEX><LEX> </LEX><LEX>&Mf</LEX><LEX> ><LEX> 1 4 </LEX><LEX> H </LEX><LEX> ZT</LEX><LEX>, </LEX><LEX>, </ </LEX </LEX><LEX>jxLjT</LEX><LEX >3-</LEX><LEX>hk,±</LEX><LEX> ri M </LEX><LEX> P 2 </LEX><LEX>*fjt </LEX><LEX> fc</LEX><LEX>, </LEX><LEX>Mj</LEX><LEX>dk-</LEX><LEX>3T,</LEX><LEX>c3 </LEX><LEX>A- </LEX><LEX>C</LEX><LEX>jyji</LEX><LEX>. LEX>. </LEX><LEX>- </LEX><LEX>-'E!</LEX><ENAMEX TYPE="PERSON"><LEX>3$fr </LEX></ENAM EX><LEX>V-</LEX><LEX> </LEX><LEX>f]j</LEX><LEX> : </LEX><LEX> </LEX><LEX>jE_</LEX><LEX>F</LEX><LEX>W$IJ *</LEX><LEX>,FJL</LEX><LEX>*k rp X><LEX>j </LEX><LEX> 3 </LEX>< </LEX><LEX> 2 5 0 </LEX><LEX>41</LEX><LEX>TUFf</LE AJL</LEX><LEX>, </LEX><LEX> 1 5 </LEX><LEX> H </LEX><LEX> LEX>)E</LEX></LEX><LEX><E. LXLX-</LEX><LEX>7j</LEX><LEX>3 3. </LEX><LEX> </LEX><LEX>- </LEX><LEX>M </LEX><LEX>Lf</ 6</LEX><LEX>*</LEX><LEX>, LEX><LEX>rn</LEX><LEX>,</LEX><LEX> 3 1 </LEX><LEX> , </LEX>< </LEX><LEX>n, </LEX><L </LEX><LEX> /E</LEX><LEX>X</LEX><LEX> EX>k2+J</LEX><LEX> </LEX><LEX>jA4)-V-</LEX><LEX>jyr/EX<E> </E>LX2</LEX><LEX>W$Ij</LEX><LEX>A Tfj </LEX><LEX>fT</LEX><LEX>f</LEX><LEX>$ i</LEX><LEX>L2fJ</LEX><LEX>ITk</ LEX><LEX>$ A</LEX><LEX>353</LEX><LEX>9cj</LEX><LEX>tc+ </LEX><LEX>, </LEX ><LEX> 4 </LEX><LEX>. </LEX><LEX>e </LEX><LEX>E 5</LEX><ENAMEX TYPE="PERSON"><LEX>*f4 </LEX></ENAMEX><LEX>4W</LEX><LEX> (</LEX><LEX>RM</LEX><LEX>2f </LEX><LEX>rJ</LEX><LEX>) X><LEX> H</LEX><LEX> E</LEX><LEX> >B V</LEX><LEX>, </LEX><LEX>jZn</LEX><LEX>, /LEX><LEX>, </LEX><LEX> 1 3 </LE @</LEX><LEX>WI l</LEX><LEX>. </LEX><LEX>NM</LEX><LEX>iMzt</LEX><LEX> LEX>V</LEX><LEX>, : </LEX><LEX> 5JjZ</LEX><LEX>, </LEX><LEX </LEX><LEX>7 </LEX><LEX>{fy</LEX><LEX>T-</LEX>< </LEX><LEX>*</LEX><LEX>E </LEX><LEX>W</LEX><LEX>JM< </LEX><LEX> r A</LEX><LEX>f]</LEX><LEX>Rfj)I</LEX><LEX> T </LEX><L EX>RM3- g</LEX><LEX>2%</LEX><LEX>BTI</LEX><LEX> I1</LEX><LEX> 21</LEX> <LEX>@<X><LEX> )V </LEX><LEX> </LEX><LEX>Utj</LEX><LEX>. </LEX> <LEX> (</LEX><ENAMEXTYPE="PERSON"><LEX>I)p</LEX></ENAMEX> <LEX>M</LEX><LEX> 1 8 </LEX><LEX>/</LEX><LEX> 7 </LEX><LEX>) C> A3 </LEX></TXT></DO Sample segmentation output from three word model <LEX> 001 raw/950715/tg/gnxw0718.15 1995 07 18 Author: Daniel Hu MIT-EECS Page 79 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 <LEX>VT (4-]</LEX><LEX>*-t</LEX> Domestic News Dept. <LEX> 1 </LEX><LEX>. </LEX><LEX>js - jE1 </LEX><LEX> </LEX><LEX>t±</LEX><LEX> </LEX><LEX>fJW</LEX><LEX> : </LEX><LEX> 9 5 </LEX><LEX> Y M,</LEX><LEX>L4j j</ LEX>] -4 X><LEX>, </LEX><LEX>R fe</LEX><LEX>xM</LELEX>LEX></LEX><LEX>.)j;</LE </LEX><LEX> 7 </LEX><LEX>yq </LEX><LEX> 1 6 </LEX><LEX> H </LEX><LEX> t</LEX><LEX>- t</LEX><LEX>E</LEX><LEX>&</LEX><LEX>--</LEX><LEX>ff *</LE X><LEX>. </LEX><LEX>A @</LEX><LEX>IR W</LEX><LEX>, </LEX><LEX>"M </LEX>< LEX>, </LEX><LEX> H ;z</LEX><LEX>, </LEX><LEX>XM</LEX><LEX>, ),-</LEX><LEX>, </LEX><LEX>OT-HafAJ</LEX><LEX>, </LEX><LEX>WPA</LEX><LEX>4,M</ LEX><LEX>*</LEX><LEX>fn</LEX><LEX>fMX</LEX><LEX> EX>W</LEX><LEX>A± </LEX><LEX>fU</LEX><LEX>l </LEX><LEX>A-</LEX><LEX>j </LEX><LEX>Ep </LEX><LEX>ag</LEX><L #</LEX><LEX> TA</LEX><LEX> 6 T </LEX><LEX>91-xZ</LEX><LEX>. Lj</LEX><LEX>#)_RR</LEX><LEX>I</LEX><LEX>r r </LEX><LEX>L R</LEX><LEX>)k*)c</LEX><LE X>rp ,J</LEX><LEX>-'</LEX><LEX>, </LEX><LEX>A W</LEX><LEX>L '</LEX><LEX> Wtj</LEX><LEX>Jj_</LEX><LEX>-Zfe</LEX><LEX>fn</LEX><LEX>I <><LEX>P,fA</LEX><LEX>j</LEX><LEX>, </LEX><LEX>O </LEX><LEX> 2 </LEX><LEX>. </LEX>< LEX>2 i</LEX><LEX>$</LEX><LEX>$)</LEX><LEX>g</LEX><LEX>M%</LEX><LEX > : </LEX><LEX>tI </LEX><LEX>-tL</LEX><LEX>*E</LEX><LEX>MJMF</LEX><LEX> M</LEX><LEX>tA</LEX><LEX></LEX><LEX> X><LEX>R </LEX><LEX>O</LEX><LEX>1 </LEX><LEX> 5 )</LEX><LEX> ifj</LEX><LEX>tit</LE </LEX><LEX>&R</LEX>< LEX>jf r</LEX><LEX>, </LEX><LEX>l 4</LEX><LEX> H</LEX><LEX> I </X><L<LEX> </LEX><LEX>M</LEX><LEX> </LEX><LEX> ><LEX>tj<LEX>-<LEX><LEX>< </LEX><LEX>nil2</LEX><LEX>M)</LEX><LEX>, </LEX><LEX>PiN</LEX><LEX> jjL</LE *M ~f<LE>LX3~</LEX><LEX>ff 3 </LEX>,LEX>,</LEX><LEX> LEX>. </LEX><LEX X><LEX>7k </X><LEX>rcA</LEX><LEX>7</LE</X><LLEX> </LEX><LEX>2TI</LEX>< LEX>. LEX><LEX>3<2</LEX><LEX>. )</LEX><LEX>15</LEX><LEX> X><LEX>*,</LEX><LEX> </LEX><LEX>*eit</LEX><LEX>jV</LEX><LEX> </LEX><LEX> : </LEX><LEX>t </LEX><LEX>%t</LE 2</LEX><LEX>f-</LEX><LEX>ff$ljd</LEX><LEX>rJ</LEX>< LEX>3.</LEX><LEX>2 5 0</LEX><LEX>, </LEX><LEX>p1 </LEX><LEX>2fjt</LEX><LE X>, </LEX><LEX>1 5 </LEX><LEX> H</LEX><LEX>jMA,</LEX><LEX>M*</LEX><LEX>$</ </LEX><LEX>E. </LEX><LEX>B</LEX><LEX>-</LEX><LEX>M,</LEX><LEX>h</LEX>< LEX>3 3. 6</LEX><LEX>*</LEX><LEX>, </LEX><LEX>jtRf</LEX><LEX>iMg,,</LEX><LE X>) </LEX><LEX> 3 1</LEX><LEX>*</LEX><LEX>ft,</LEX><LEX> E-<LE>LE>< LEX><LEX>fAI' *</LEX><LEX>T-V</LEX><LEX>ft,</LEX><LEX>) V</LEX><LEX>i,'R</LEX ><LEX>. </LEX><LEX>jI2j</LEX><LEX>fJRf</LEX><LEX>r LEX>r I </LEX><LEX>tE~t</LEX><LEX>zik</LEX><LEX>fpXA</LEX><LEX> LEX>$%i)</LEX><LEX>7*iT</LEX><LEX>. Author: Daniel Hu C</LEX><LEX>j8j </LEX>< [T,</LEX>< </LEX> MIT-EECS Page 80 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 <LEX> (</LEX><LEX>_M</LEX><LEX>r </LEX><LEX>M</LEX><LEX>M</LEX><LEX> 1 8 </ LEX>/<LEX> 7 </LEX><LEX>) A4 </LEX> Sample segmentation output from Unigram model <DOC><TXT> <LEX>001 </LEX> <LEX>raw/950715/tg/gnxwO7l8.15</LEX> <LEX>1 995</LEX> <LEX>07</LEX><LEX>1 8</LEX><LEX> <LEX>fT f-]</LEX><LEX>*-t</LEX> <LEX>Domestic</LEX><LEX>News</LEX><LEX>Dept.</LEX> <LEX> 1 </LEX><LEX>. </LEX><LEX>- e </LEX><LEX>i W</LEX><LEX>, tth</LEX><LE X>fgq</LEX><LEX> : </LEX><LEX>9 5</LEX><LEX>$L</LEX><LEX>E</LEX><LEX>L2j</ LEX><LEX>@MR</LEX><LEX>NG</LEX><LEX>M </LEX><LEX>*</LEX><LEX>,Ai;</LE X><LEX>, </LEX><LEX> 7 </LEX><LEX>-j</LEX><LEX> 1 6 </LEX><LEX> H</LEX><LEX> </LEX><LEX>Vf</LEX><LEX>, </ </LEX><LEX>E </LEX><LEX>1 t<EX>-:Iaf*t LEX><LEX>* r-</LEX><LEX>ff.5T Wf</LEX><LEX>, </LEX><LEX>"M</LEX><LEX>, </LEX ><LEX> H *</LEX><LEX>, > </LEX><LEX> </LEX><LEX>, jM </LEX><LEX>, </LEX><LEX>E lB t</LEX><LEX fJE</LEX><LEX>, </LEX><LEX>@m</LEX><LEX>4</LEX><LEX> < K</LEX><LEX>@</LEX><LEX>a</LEX><LEX>W</LEX><L I</LEX><LE </LEX><LEX> 6 T EX>A i</LEX><LEX>T</LEX><><LE#VlAW</LEX><LEX>, /LEX><LEX>f</LEX><LEX> -</LEX><LEX> X>A #</LEX><LEX>fn- T </LEX><LEX>f X</LEX><LEX> )1_j</LEX><LEX>-j2</LEX><LEX>$$N,/ </LEX><LEX> LiE</LEX><LEX> </LEX><LE </LEX><LEX> * </LEX><LEX>Fft1</LEX><LEX> J LX</LEX><LEX>n </LEX><LEX>-r</LEX><LEX>, X> </LEX>< LEX><LEX> </LEX><LEX>Mit</LEX><LEX>-ZfL</LEX><LEX>t</LEX><LEX> f</LEX><LEX>. </LEX><LEX> LEX> W</LEX><LEX> </LEX><LEX>p R </LEX><LEX>iE 2</LEX><LEX>. </LEX><LEX>t' -I </LEX><LEX>Lt</LEX><LEX></LE</LEX><LEX>W< </LEX><LEX> : </LEX><LEX>$ 3</LEX><LEX>-L</LEX><LEX>) A</LEX><LEX> -j<F</ </LEX><LEX>nij</LEX><LEX>ii5t</LEX LEX>LEX>j</LEX><LEX> y</X<LEX>1> ><LEX>R </LEX><LEX> </LEX><LEX> 1 5 H</LEX><LEX> </LEX><LEX>4R</LEX>< LEX>r</LEX><LEX>, </LEX><LEX>1 4</LEX><LEX> H </LEX><LEX>_-</LEX><LEX> < /LEX><LEX>gj</LEX><LEX>k;-</LEX><LEX>k </LEX><LEX> 3</LEX><LEX> <</L EX><LEX>Pi </LEX><LEX>M</LEX>EX>, </LEX><LEX>RI</LEX><LEX> tI</LEX> <LEX>MFT</LEX><LEX>I-C</LEX><LEX>*T </LEX><LEX>r@</LEX><LEX>),MM~h</LEX><LE X>. </LEX><LEX> 3 </LEX><LEX>. </LEX><LEX>12 - </LEX><LEX>1P @</LEX><LEX>3:l* ;</LEX><LEX>fM-%</LEX><LEX>: </LEX><LEX>rt 3</LEX><LEX>%tL</LEX><LEX>k3A</ j</LEX>< LEX><LEX>j2I</LEX><LEX></LEX><LEX>W$Ij</LEX><LEX>&j</LEX><LEX> LEX> 2 5 0 </LEX><LEX>nq</LEX><LEX>T-</LEX><LEX>k2_fj</LEX><LEX>, Author: Daniel Hu MIT-EECS </LEX><LE Page 81 5/29/01 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text X> 1 5 </LEX><LEX> F1 </LEX><LEX>jHj</LEX><LEX>M A</LEX><LEX>mi524Ij </LEX><LEX> </LEX><LEX>B2</LEX><LEX> -</LEX><LEX>MIl j-</LEX><LEX>hj</LEX><LEX> 3 3. </LEX><LEX>j_</LEX><LEX>R1ij</LEX><LEX>j,-</LEX><LEX LEX><LEX>*</LEX><LEX>, >t,</LEX><LEX> 3 1 </LEX><LEX>f</LEX><LEX> EX><LEX>tA </LEX><LEX>jE-Vt</LEX><LEX>K</L X>W</LEX><LEX>r Vj </LEX><LEX>MIf</LEX><LE S </LEX><LEX>itAJL</LEX><LEX>T-</LEX><LEX> E>LX; '-</LEX </LEX><LEX>j ><E> 4<</LEX><LEX )K</LEX><LEX _</LEX><LEX>i*</LEX><LEX>1t-</LEX><LEX> : </LE </LEX><LEX /LEX> <,T</LEX><LEX>R</LEX><LEX>) X>< 3 </L </LEX><LEX> 4 </LEX><LEX>, >MJ f</LEX><LEX>%2</LEX><LEX>*T</LEX><LEX>. EX><LEX>3 </LEX><LEX>jir</LEX> </LEX><LEX>I f</LEX><LEX>Oj</LEX><LEX> </LEX><LEX>V rIJ</LEX><LEX>WJ</LEX><LEX>j <LEX>. 6 </ > 1 3 </LEX><LEX> H</LEX><LEX>r±</LEX><LEX>TyP</LEX><LEX>R $l</LEX><LEX>. </L EX><LEX>i> </LEX><LEX>j,</LEX><LEX>j6 <LEX>? @</LEX><LEX>, </LEX><LEX>-ZV</LEX><LEX>, </LEX><LEX>*</LEX><LEX>M, jF</LEX><LEX>@</LEX><LE </LEX><LEX>V</LEX><LEX>, X>JEWx</LEX><LEX>, </LEX> </LEX><LEX>, </LEX><LEX>)M </LEX><LEX>3Mj4</LEX><LEX> </ </LEX><LEX>neXA</LEX><LEX>f]</LEX><LEX>Vrl</LEX><LEX>-T LEX><LEX>#A fg-Eg</LEX><LEX>'/-J</LEX><LEX>ixLTI</LEX><LEX>rpM</LEX><LEX>'* </LEX><LEX>@</LEX><LEX>Ij</LEX><LEX> </LEX><LEX>)-tJ</LEX><LEX>. </LEX> Sample segmentation output from PKU segmenter A5 <DOC> <METID> 001 </METID> <ID> raw/950715/tg/gnxw0718.15 </ID> <DATE> 1995 07 18 </DATE> <HL> fT[-j-t </HL> <AU> Domestic News Dept. </AU> <TEXT> 1. <S>4*/r #'/njEF /n t/nr t-t--/nr V-/v : /w 9 5 /mr rM/ns L/ns i tf/n 0/v 'i it/a , /w 7 P/t 1 6 H/t E/p </S><S>* n /v 2RV W/ns f*4/nr t/v Lf- /w -%M/ns , /w H */ns , /w , /w WI/ns V/v M*/n V/c M[/n n,/u aV4/v W/n *li/v T/u f 0/u -/m <LEX>. </LEX></S> 2. 0 /u MP-/n --Zf/n ffi]/n mIjt/v <LEX>. </LEX></S> 3. /v </LEX> ±I/n n/c-W/r M A/n 6 F/m $r± /m \/n 't'/f /n )*g/n f/in /p I-tj/a ro/u R &/n n/f /n @/ns E5,JOr'/fi*/s 3'/v nj/u 1 5 H/m []/q /w 1 4 H/t -I/v <LEX>. </LEX></S><S>L/r j/v j-' /d */a , /w R R/n f(/n =M/ns , /w EPf/ns , /w iVJH f)iA/ns <S>;2r/r--/n EWt/n $/nr $29/nr MS/v : /w 3j.T/v F/n tk/p T@/ns T/n P/j jfC/n f 4-/v <LEX>. x/n <LEX>. </LEX></S><S> iELM/ns $A#/ns &/n !/v -M/q , /w -F g/v L ,/a AR$/b fAr/n, -i/ns #]/ns iiL/IAt'/ns I F/v W/t MJT/n Ic-A/a tF/n rj/u ,'M,/n frL/n , /w 1 4 H/t RI/v <S>;2r/r -'/njP-/n 4F/nr $A%/nr f94/v : /w r M/ns -,,JL j/f I/v V/n TIi1Wj/v vJ j/a fj/u 2 5 0 /m []/q 41-/v jE-IRTJL/n, /w 1 5 H/t ifAif/p )/vf fis 5%2k/n <LEX>. </LEX></S> Author: Daniel Hu MIT-EECS Page 82 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 </TEXT> </DOC> Author: Daniel Hu MIT-EECS Page 83 5/29/01 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text Appendix B Selected Code Sample spec file B1 INCLUDE general-specs.spec * .collection CHINESE-LDC *.language ZH ######################################################################## Preprocess ### 4*4*# ######################################################################## ### ###4 NIL prelembic.process? "NE" "ML-NE" "ML-NE" lisp.phases ne.language=ZH.passes ne. language=ZH.pass-order rules/danhu-learned-2##ne-language=ZH.pass=ML-NE.rules rules.lisp ne.language=ZH.pass=ML-NE.rules /afs/rcf/project/corporal/chinese/danhu/traininglog/rules-exp-44235.lisp lisp.phase=NE.language=ZH.out-tags lisp.phase=NE.language=ZH.tag=NR.start-tag lisp.phase=NE.language=ZH.tag=NT.start-tag lisp.phase=NE.language=ZH.tag=NS.start-tag lisp.phase=NE.language=ZH.end-tag B2 "NR NT NS" "<ENAMEX\ TYPE=\"PERSON\">" "<ENAMEX\ TYPE=\"ORGANIZATION\">" "<ENAMEX\ TYPE=\"LOCATION\">" "</ENAMEX>" Sample learned rules file for phrasing The following rules were learned for Organization Tagging ;; Rule #1 (def-phraser-rule :LEXEME :anchor (:NOT (:WD :conditions (:LWD-1 :actions (:CREATE :EVAL #I'PART-OF-SOME-PHRASE?)) :IntI) :LEX :WD :WD :NT) Rule #2 (def-phraser-rule :conditions (:PHRASE :PHRASE-LABEL (:LWD-1 :LEX-SUFFIX :actions "4" :NT) ) (:SET-LABEL :NT) (:EXPAND :LEFT-1) ) Rule #3 ;; (def-phraser-rule (:PHRASE :PHRASE-LABEL :NT) :conditions (:LEFT-1 :P-O-S :InI) :actions (:SET-LABEL :NT) (:EXPAND :LEFT-1) ) ;; Rule #4 (def-phraser-rule :LEXEME :anchor :conditions (:NOT (:WD :EVAL #'PART-OF-SOME-PHRASE?)) Author: Daniel Hu MIT-EECS Page 84 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text (:RIGHT-1 :actions (:CREATE :LEX-SUFFIX :WD :WD 5/29/01 "k") :NT) Rule #5 (def-phraser-rule (:PHRASE :PHRASE-LABEL :NT) :conditions (:RIGHT-1 :LEX-SUFFIX "Eli") :actions (:SET-LABEL :NT) (:EXPAND :RIGHT-1) Rule #6 (def-phraser-rule :conditions :actions :PHRASE-LABEL :LEX :InsI) (:SET-LABEL :NT) (:PHRASE (:LEFT-1 (:EXPAND :NT) :LEFT-1) Rule #7 (def-phraser-rule :conditions (:PHRASE (:RWD-1 :actions :PHRASE-LABEL :NT) :LEX "ODN&') (:SET-LABEL :NT) (:EXPAND :RIGHT-1) Rule #8 (def-phraser-rule (:PHRASE :PHRASE-LABEL :NT) :conditions (:LWD-2 :LEX-SUFFIX "2') (:SET-LABEL :NT) :actions (:EXPAND :LEFT-1) ) Rule #9 (def-phraser-rule (:PHRASE :conditions (:LEFT-1 :actions :PHRASE-LABEL :NT) :P-O-S :InsI) (:SET-LABEL (:EXPAND :NT) :LEFT-1) Rule #10 (def-phraser-rule :conditions (:PHRASE :PHRASE-LABEL (:RWD-1 :LEX :InrI) :actions (:SET-LABEL :NT) :IGNORE) ) ;; Rule #11 (def-phraser-rule (:PHRASE :PHRASE-LABEL :NT) :conditions (:LWD-1 :LEX :011) (:SET-LABEL :IGNORE) :actions Rule #12 (def-phraser-rule :anchor :LEXEME :conditions (:NOT (:WD :EVAL #'PART-OF-SOME-PHRASE?)) (:RIGHT-1 :LEX "Y~IA") Author: Daniel Hu MIT-EECS Page 85 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text :actions (:CREATE :WD :WD 5/29/01 :NT) Rule #13 (def-phraser-rule :conditions (:PHRASE :PHRASE-LABEL :NT) (:RIGHT-1 :LEX :ZHORGSUFFIX) :actions (:SET-LABEL :NT) (:EXPAND :RIGHT-1) Rule #14 (def-phraser-rule :conditions :actions (:PHRASE (:LEFT-1 :PHRASE-LABEL (:EXPAND :LEFT-1) :NT) :InzI) (:SET-LABEL :NT) :LEX Rule #15 (def-phraser-rule :conditions (:PHRASE (:LEFT-1 :actions (:SET-LABEL :NT) (:EXPAND :PHRASE-LABEL :LEX :InsI) :NT) :LEFT-1) Rule #16 (def-phraser-rule :conditions :actions (:PHRASE :PHRASE-LABEL (:RIGHT-1 :LEX-SUFFIX :NT) "3/") (:SET-LABEL :NT) (:EXPAND :RIGHT-1) Rule #17 (def-phraser-rule :conditions (:PHRASE :PHRASE-LABEL (:LWD-1 :LEX :IvI) :actions (:SET-LABEL (:CONTRACT :NT) :NT) :LWD-2) Rule #18 (def-phraser-rule :PHRASE-LABEL :conditions (:PHRASE :actions :LEX "i-6±>4") (:RWD-1 (:SET-LABEL :NT) (:EXPAND :LEFT-1) :NT) Rule #19 (def-phraser-rule :conditions (:PHRASE :PHRASE-LABEL :NT) (:RIGHT-1 :P-O-S :1j1) :actions (:SET-LABEL :NT) (:EXPAND :RIGHT-1) Rule #20 (def-phraser-rule :anchor :LEXEME :conditions (:NOT Author: Daniel Hu (:WD :EVAL #'PART-OF-SOME-PHRASE?)) MIT-EECS Page 86 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text (:RIGHT-l :actions B3 (:CREATE 5/29/01 :LEX "EDi~") :WD :WD :NT) Manually Written Rules for Organization Tagging (def-phraser-rule :lexeme :anchor (:wd :lex-suffix :ZHORGSUFFIX) :conditions (:wd :p-o-s (:InI :jI :111)) (:NOT (:wd :eval #'part-of-some-phrase?)) (:create :wd :wd :ORG-HEAD) :actions :traced? NIL) (def-phraser-rule :anchor :phrase (:phrase :phrase-label :ORG-HEAD) :conditions (:wd-any :lex-suffix :ZHORG_NEGSUFFIX) :actions (:set-label :IGNORE)) (def-phraser-rule :phrase :anchor (:phrase :phrase-label :ORG-HEAD) :conditions (:wd-any :p-o-s :inl) (:left-1 :p-o-s :In|) (:left-2 :p-o-s :Inr) (:left-3 :p-o-s :InsI) (:expand :left-3) :actions (:set-label :ORGANIZATION)) (def-phraser-rule :phrase :anchor (:phrase :phrase-label :ORG-HEAD) :conditions (:wd-any :p-o-s :In|) (:left-1 :p-o-s :In|) (:left-2 :p-o-s :InsI) (:left-3 :p-o-s :InsI) (:expand :left-3) :actions (:set-label :ORGANIZATION)) (def-phraser-rule :anchor :phrase (:phrase :phrase-label :ORG-HEAD) :conditions (:wd-any :p-o-s :InI) (:left-1 :p-o-s :IvnI) (:left-2 :p-o-s :Inj) (:left-3 :p-o-s :InI) (:expand :left-3) :actions (:set-label :ORGANIZATION)) (def-phraser-rule :phrase :anchor (:phrase :phrase-label :ORG -HEAD) :conditions (:wd-any :p-o-s :InI) (:left-1 :p-o-s :InI) (:left-2 :p-o-s :Inz|) (:left-3 :p-o-s :Ins|) (:expand :left-3) :actions (:set-label :ORGANIZATION)) (def-phraser-rule :phrase :anchor (:phrase :phrase-label :ORG -HEAD) :conditions (:wd-any :p-o-s :InI) (:left-1 :p-o-s (:|nsl :Inz )) (:left-2 :p-o-s :InsI) :actions (:expand :left-2) (:set-label :ORGANIZATION)) Author: Daniel Hu MIT-EECS Page 87 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 (def-phraser-rule :phrase :anchor (:phrase :phrase-label :ORG-HEAD) :conditions (:wd-any :p-o-s :InI) (:left-l :p-o-s :Inj) (:left-2 :p-o-s :InsI) (:expand :left-2) :actions (:set-label :ORGANIZATION)) (def-phraser-rule :phrase :anchor (:phrase :phrase-label :ORG-HEAD) :conditions (:wd-any :p-o-s :InI) (:left-1 :p-o-s :InzI) (:left-2 :p-o-s :InsI) (:expand :left-2) :actions (:set-label :ORGANIZATION)) (def-phraser-rule :anchor :phrase (:phrase :phrase-label :ORG-HEAD) :conditions (:wd-any :p-o-s :InI) (:left-1 :p-o-s (:IvnI :IbI)) (:left-2 :p-o-s :InsI) (:expand :left-2) :actions (:set-label :ORGANIZATION)) (def-phraser-rule :phrase :anchor (:phrase :phrase-label :ORG-HEAD) :conditions (:wd-any :p-o-s :InI) (:left-1 :p-o-s :1n1) (:left-2 :p-o-s :InzI) (:expand :left-2) :actions (:set-label :ORGANIZATION)) (def-phraser-rule :anchor :phrase (:phrase :phrase-label :ORG-HEA.D) :conditions (:wd-any :p-o-s :InI) (:left-1 :p-o-s :In|) (:left-2 :p-o-s :InI) (:left-3 :p-o-s (:Inl :IvnI)) :actions (:expand :left-3) (:set-label :ORGANIZATION)) (def-phraser-rule :anchor :phrase (:phrase :phrase-label :ORG-HEAD) :conditions (:wd-any :p-o-s :InI) (:left-1 :p-o-s :IvnI) (:left-2 :p-o-s :InI) (:left-3 :p-o-s :InsI) (:expand :left-3) :actions (:set-label :ORGANIZATION)) (def-phraser-rule :phrase :anchor (:phrase :phrase-label :ORG-HEAD) :conditions (:wd-any :p-o-s :InI) (:left-l :p-o-s :InI) (:left-2 :p-o-s :InI) (:expand :left-2) :actions (:set-label :ORGANIZATION)) (def-phraser-rule :anchor :phrase (:phrase :conditions (:wd-any (:left-1 (:expand :actions Author: Daniel Hu :phrase-label :ORG-HEAD) :p-o-s :InI) :p-o-s (:InI :Intl)) :left-1) MIT-EECS Page 88 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 :ORGANIZATION)) (:set-label (def-phraser-rule :phrase :anchor :conditions (:phrase (:wd-any (:left-1 :phrase-label :ORG-HEAD) :p-o-s :InI) :p-o-s (:InzI :InsI :Ijl)) (:expand :left-1) :actions (:set-label :ORGANIZATION)) (def-phraser-rule :anchor :phrase :conditions (:phrase :phrase-label :ORG-HEAD) (:wd-any :p-o-s :1j1) (:left-1 :p-o-s (:Intl :InsI :Inl)) (:expand :left-1) (:set-label :ORGANIZATION)) :actions (def-phraser-rule :lexeme :anchor :conditions (:NOT (:wd :eval #'part-of-some-phrase?)) (:wd :p-o-s :Intl) (:create NIL) :actions :traced? :wd :wd :ORGANIZATION) (def-phraser-rule :anchor :conditions :phrase (:phrase :phrase-label :ORGANIZATION) (:left-1 :p-o-s :InsI) :actions (:expand :left-1)) (def-phraser-rule :anchor :conditions :actions B4 :phrase (:phrase :phrase-label :ORG-HEAD) (:set-label :IGNORE)) MMS Segmenter #!/usr/bin/perl # Read in the lexicon open(W RDS, "/afs/rcf/project/corporal/chinese/danhu/wordlist.txt") or die "Can't open wordlist\n"; while (<WRDS>) chomp; next if /A\#/; $cwords{$_} = 1; if (length($_) == 6) { if (!exists($cwords{substr($_, 0, 4)})) { $cwords{substr($_, 0, 4)) = 2; } } if (length($_) == 8) { if (!exists($cwords{substr($_, 0, 4)))) $cwords{substr($_, 0, 4)} = 2; } if (!exists($cwords{substr($_, 0, 6)})) $cwords{substr($_, 0, 6)} = 2; I if (length($_) == 10) { if (!exists($cwords{substr($_, 0, 4)})) '7b $cwords{substr($_, 0, 4)) = 2; } if (!exists($cwords{substr($_, 0, 6)})) { $cwords{substr($_, 0, 6)} = 2; } if (!exists($cwords{substr($_, 0, 8)})) $cwords{substr($_, 0, 8)) = 2; Author: Daniel Hu MIT-EECS Page 89 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 close(W RDS); sub addsegword { ($-) = shift; next if /A\#/; $cwords{$_} = 1; if (length($_) == 6) { if (!exists($cwords{substr($_, 0, 4)})) { $cwords{substr($_, 0, 4)) = 2; if (Iength($_) == 8) { if (!exists($cwords{substr($ , 0, 4)})) $cwords{substr($_, 0, 4)} = 2; } if (!exists($cwords{substr($_, 0, 6)))) $cwords{substr($_, 0, 6)} = 2; if (length($_) == 10) { if (!exists($cwords{substr($_, 0, 4)})) 7b $cwords{substr($_, 0, 4)) = 2; if (!exists($cwords{substr($_, 0, 6)})) { $cwords{substr($_, 0, 6)} = 2; if (!exists($cwords{substr($_, 0, 8)})) { $cwords{substr($_, 0, 8)) = 2; } # Numbers $numbers =0"O $numbers AbAi:d 9-7-O 1 2 3 4 5 6 7 8 9 AM"; for ($n = 0; $n < length($numbers); $n+=2) { $cnumbers{substr($numbers, $n, 2)) = 1; } #Classifiers (to follow numbers) $classif ier = " - i-, ,LV EfA M WV, 11 $classifier .="TT-pq A A-*-tR Mk, for ($n = 0; $n < length($classifier); $n+=2) { R X4Id qmft-IFil $$Jji -9f ti%" "; $classifiers{substr($classifier, $n, 2)) = 1; # Wide ASCII words $wascii = "a b c d e f g h ij k Im n o pq r s t uvwx y z. $wascii.="ABCDEFGH I JKLMNOPQRSTUVWXYZ-"; $wascii .= ; for ($n = 0; $n < length($wascii); $n+=2) { $cascii{substr($wascii, $n, 2)} = 1; # Foreign name transliteration characters $foreign iA =" $foreign .="E $foreign .= R? g * I gf g gr$p M %I~j4 U V$$r 2%3 tg f @$ M RDtfH b I;N3-T-31 ga g MWM 1 J+ f gg - g f" g eg";, $foreign $foreign = for ($n = " leng ii * iEft 0; $n < Iength($foreign); $n+=2) { Author: Daniel Hu EpftL g g ; MIT-EECS Page 90 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 $cforeign{substr($foreign, $n, 2)} = 1; } #Chinese surnames $surname = "?, 1- ' $surname .= "-Egf, $surname .= T 'tMM " -" 1 - E fit$E9}%UItll5UZ'5 4NME ; $surname $surname $surname V MIJF .="[ -R,EM ct fA - *t 0 JTE:"; .="31-01,r M *P-MQM& UV,AJ5 $surname.R-1T $surname.="-T $surname .= P H44QAFBW WQR-W"; tM flf$ *24fENft~Ae $surname .= " F# "! $uncommonsurname = ".=4Ji*4J~fl@ for ($n = 0; $n < length($surname); $n+=2) { $csurname{substr($surname, $n, 2)} = 1; " 4r VA- 1150"; 5 NAME ±"; # 3 for ($n = 0; $n < length($uncommonsurname); $n+=2) { $uncommoncsurname{substr($uncommonsurname, $n, 2)} = 1; } # Add in 2 character surnames; also add to lexicon so they'll be segmented as one unit $csurname{"'tt"} = 1; $cwords{"tE$I1"} = 1; $csurname{"TJj"'} = 1; $cwords"TNh"} = 1; $csurname{"jM"} = 1; $cwords{"t4Mf"} = 1; 4"} = 1; $csurname{"*4"} = 1; $cwords{"y $csurname{"VXB"} = 1; $cwords{"RfPB"} = 1; $csurname{"$ T"} = 1; $cwords{" tt±"} = 1; $csurname{"Ifjt"} = 1; $cwords{"Jr"]} = 1; $csurname{"Tfl"} = 1; $cwords{" fJ1"}= 1; $csurname{"=ift"} = 1; $cwords{"=i4M"} = 1; $csurname{"P e "} = 1; $cwords{"4 a"} = 1; $csurname{"itg"} = 1; $cwords{"iMgg} = 1; $punctuation .= " , * [ 0 ®~ [I H -? #Not in name $notname = $notname .= $punctuation; for ($n = 0; $n < length($notname); $n+=2) { $cnotname{substr($notname, $n, 2)} = 1; } sub addChineseNames ($tmpline) = @-; $tlen = length($tmpline); $newline = ""; for ($m = 0; $m < $tlen; $m++) { $tchar = substr($tmpline, $m, 1); $currtoken = ""; if ($tchar =~ /A\s$/) { $newline.= $tchar; else { $currtoken = Author: Daniel Hu MIT-EECS Page 91 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text while ($tchar !~ /\s$/ 5/29/01 and $m < $tlen) $currtoken .= $tchar; $m++; $tchar = substr($tmpline, $m, 1); if (defined($csurname{$currtoken}) or defined($uncommoncsurname{$currtokenpd)) { # found a surname, see what follows # go past following spaces $tchar = substr($tmpline, $m, 1); $spaces = ""; while ($tchar =- As/ and $m < $tlen) $spaces .= $tchar; $m++; $tchar = substr($tmpline, $m, 1); # Get next token $tchar = substr($tmpline, $m, 1); $currtoken2 = ""; while ($tchar ! As/ and $m < $tlen) $currtoken2 .= $tchar; $m++; $tchar = substr($tmpline, $m, 1); # go past following spaces $tchar = substr($tmpline, $m, 1); $spaces2 = ; while ($tchar =~ As/ and $m < $tlen) $spaces2 .= $tchar; $m++; $tchar = substr($tmpline, $m, 1); # Get next token $tchar = substr($tmpline, $m, 1); $currtoken3 = ; while ($tchar!~ As/ and $m < $tlen) $currtoken3 .= $tchar; $m++; $tchar = substr($tmpline, $m, 1); if (isChinese($currtoken2) and (length($currtoken2) == 2) and (!defined($cnotname{$currtoken2) and isChinese($currtoken3) and length($currtoken3) !defined($cnotname{$currtoken3})) == 2 and $newline .= $cname[O] . $currtoken $currtoken2 . $currtoken3 $cname[1]; #$cwords{$currtoken $currtoken2 $currtoken3} = 1; #$cwords{$currtoken $currtoken2} = 2; # short version for checking } elsif (isChinese($currtoken2) and (length($currtoken2) == 2) and (!defined($cnotname{$currtoken2}))) $newline .= $currtoken . $currtoken2 . $spaces2 . $currtoken3; $cwords{$currtoken . $currtoken2} = 1; } elsif (defined($csurname{$currtoken}) and isChinese($currtoken2) and (length($currtoken2) == 4) and ($cwords{$currtoken2} != 1) and (!defined($cnotname{$currtoken2}))) $newline .= $cname[O] . $currtoken $currtoken2. $cname[1] $spaces2. $currtoken3; $cwords{$currtoken . $currtoken2} = 1; $cwords($currtoken . substr($currtoken2, 0, 2)} = 2; # short version to check } elsif (defined($uncommoncsurname{$currtoken}) and isChinese($currtoken2) and (length($currtoken2) == 4) and (!defined($cnotname{$currtoken2})) and ($cwords{$currtoken2) != 1)) $newline .= $cname[0] . $currtoken . $currtoken2 . $cname[1] . $spaces2 . $currtoken3; $cwords{$currtoken . $currtoken2} = 1; $cwords{$currtoken . substr($currtoken2, 0, 2)} = 2; # short version to check else { Author: Daniel Hu MIT-EECS Page 92 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 $newline .= $currtoken $spaces $currtoken2 $spaces2 $currtoken3; else $newline .= $currtoken; $m--; # reset so won't skip space $newline; #sub cwordstart # my($tword) = # if (grep(/^$tword/, @cwordlist) > 0) { return 1; }else { return 0; # # # #} #} sub isChinese { my($cchar) = for ($b = 0; $b < length($cchar); $b++) { if (unpack("C", substr($cchar, $b, 1)) < 128) return 0; return 1; } sub allnum { ($Iocalnum) = # Need this if? if ($Iocalnum =~ m/[0-9][0-9]*(\.[0-9]+)?/) { return 1; } for ($k = 0; $k < length($Iocalnum); $k+=2) { if (!defined($cnumbers{substr($ocalnum, $k, 2)))) return 0; return 1; } sub allnumbers { my($Iocalnum) = if ($Iocalnum =~ m/[0-9][0-9]*(\.[0-9]+)?/) { return 1; } for ($k = 0; $k < length($Iocalnum); $k+=2) { if (!defined($cnumbers{substr($ocanum, $k, 2)})) return 0; return 1; sub allwasci { ($Iocalstr) = for ($k = 0; $k < length($localstr); $k+=2) { if (!defined($cascii{substr($ocalstr, $k, 2)))) return 0; Author: Daniel Hu MIT-EECS Page 93 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 return 1; } sub allforeign { ($localstr) = for ($k = 0; $k < length($localstr); $k+=2) _7b if (!defined($cforeign{substr($localstr, $k, 2)})) { return 0; return 1; } sub segmentline( { my($Iine) = @-; $chinaccum =" $outline = ""; $Iinelen = length($Iine); for ($i = 0; $i <= $Iinelen; $i++) { $charl = substr($Iine, $i, 1); if (unpack("C", $charl) > 127) $chinchar = substr($Iine, $i, 2); if ($chinaccum eq "") { $outline .= " " unless $i == 0; $chinaccum = $chinchar; _7d else ( if (defined($cwords{$chinaccum $chinchar}) and $cwords{$chinaccum . $chinchar} == 1) { # is in lexicon $chinaccum .= $chinchar; } elsif (allnum($chinaccum) and (defined($cnumbers{$chinchar}) or defined($classifiers{$chinchar}))) { $chinaccum .= $chinchar; } elsif (allwascii($chinaccum) and defined($cascii{$chinchar})) { $chinaccum .= $chinchar; elsif (allforeign($chinaccum) and defined($cforeign{$chinchar}) and defined($cwords{substr($Iine, $i, 4))) and $cwords{substr($ine, $i, 4)) 1 and $cwords{substr($ine, $i, 4)) 2) { $chinaccum .= $chinchar; } elsif (defined($cwords{$chinaccum . $chinchar}) and ($cwords{$chinaccum . $chinchar} == 2) and defined($cwords{$chinaccum . $chinchar . substr($Iine, $i+2, 2)}) and (($cwords{$chinaccum $chinchar . substr($line, $i+2, 2)) == 1) or ($cwords{$chinaccum $chinchar substr($Iine, $i+2, 2)} == 2))) # starts a word in the lexicon $chinaccum .= $chinchar; else { $outline.= $chinaccum . $chinaccum = $chinchar; # start anew $i++; else { # Plain ascii text, attach any accumulated Chinese and then ascii if ($chinaccum ne ") _7b $outline.= $chinaccum . $chinaccum = $outline .= $charl; $chinline = addChineseNames($outline); $chinline; B5 Translating wordlists from Web based dictionary Author: Daniel Hu MIT-EECS Page 94 5/29/01 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text #!/afs/rcf/lang/bin/perl srand; #Build arrays of the words: [english word] [type] [chinese word] open(INFILE, "< /afs/rcf/project/empty/data/dictionary/chinese/online/ntw.types") or die "can't open file"; $i=0; while(<INFILE>){ chomp; !- ($_ if m/^\;/){ $listl{$_} = rand; $list2[$i] = $-; $i++; } } @list3 = sort{$listl{$a} <=> $listl($b}} @list2; foreach $enword (@list3){ print $enword. "\n"; } #Prompt ciba-online for the Chinese word sub pipecmd { my $cmd = shift(@_); if (open(SENDOUT,"|$cmd")) { foreach $cmd (C_) { print SENDOUT $cmd."\n"; } close SENDOUT; } else { die; } #exit(0); open(OUT, "> /afs/rcf/project/corporal/chinese/danhu/zh-geo-world-provinces.txt") or die "can't open output file"; foreach $enword (@list3){ $enword S/^\s+$//; $wordlen = length($enword)+6; pipecid "/afs/rcf/user/jhndrsn/bin/nc webproxyl.mitre.org 80 > tempword.html", "POST http://ciba.kingsoft.net/cgi-bin/ecdetail.cgi HTTP/1.0 "Referer: http://ciba.kingsoft.net/online/main.html", "User-Agent: Mozilla/4.75 [en] (Xli; U; SunOS 5.7 sun4u)", "Host: ciba.kingsoft.net:80", "Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*", "Accept-Encoding: gzip", "Accept-Language: en-US, en, en-GB, de, zh-CN, zh-TW, cs", "Accept-Charset: iso-8859-1,*,utf-8"1, "Content-type: application/x-www-form-urlencoded", "Content-length: $wordlen", "terms=$enword", $newhtmlfile = "/afs/rcf/project/corporal/chinese/danhu/ntwhtmlfiles/".$enword.".05082001.html"; print $newhtmlfile. "\n"; pipecmd ( "cp tempword.html $newhtmlfile", Author: Daniel Hu MIT-EECS Page 95 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 exit(0); print $enword."\n"; undef($/); open(HTML, "< tempword.html") or die "unable to find tempword.html"; $htmltext = <HTML>; close(HTML); $chword = extractmeaning($htmltext); $wordhash{$enword} = ["$chword", "countries"]; print $enword."\n"; sleep 1; } #Extract the Chinese word from the junk that is passed back and place it in the array sub extractmeaning{ my($mytext) = undef($/); $mytext =- s/^.*\[f"dk\]//s; if ($mytext =- m/\ (*5166|'2ia-2,'e6T\)/) >2'&6t)\n"; #print " (*5[66| 'P6 return 0; }else{ $mytext =_ S/A.*n\.//s; $i=0; $mychword = while((substr($mytext, $i, 1) ne ",") and (substr($mytext, $i, 1) ne ")") and (substr($mytext, $i, 1) ne "<")){ $mychword .= substr($mytext, $i, 1); "$mychword\n"; #print $i++; #print substr($mytext, $i, 1)."\n"; $mychword =~ s/^\S+//; #print "$mychword\n"; return $mychword; #print $mytext; } } #Print the resulting array however deemed most useful foreach $enword (%wordhash){ array = @($wordhash{$enword}}; print $array[0]." ".$array[l]."\n"; printf( "%15s%15s%20s\n", $enword, $array[0], $array[1]); } Author: Daniel Hu MIT-EECS Page 96 5/29/01 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text Appendix C Sample Score File C1 Evaluation output of the trained phraser Parameters presumably in effect when rules were derived: Host machine on which learner executed (excl::*current-site-name*) = delphi.mitre.org ;;*RELEVANT-LOCS*= LEFT-2 LEFT-1 WD-ANY WD-SPAN RIGHT-1 RIGHT-2 LWD-i LWD-2 RWD-1 RWD-2 ;;;*EXTERNAL -LOCS*= LEFT-1 LEFT-2 RIGHT-1 RIGHT-2 RIGHT-3 LEFT-3 *COMPLEX-LOC-RESTRICTIONS* = (2 RIGHT-1 RIGHT-2) (2 RIGHT-2 RIGHT-1) (2 LEFT-1 LEFT-2) (2 LEFT-2 LEFT-1) (2 LEFT-1 RIGHT-1) (2 RIGHT-1 LEFT-1) (2 LWD-l LEFT-1) (2 RWD-1 RIGHT-1) *COMPLEX-LOC-INCLUSION-PCTS* = (1 1.0) (2 0.5) (3 0.2) (4 0.1) (5 0.1) (6 0.1) *RELEVANT-KEYS* = NT-CONTRACT-RIGHT-i NT-CONTRACT-LEFT-i NT-EXPAND-LEFT-1 NT-EXPAND-RIGHT-i ;;; NT-SEED NT *IRRELEVANT-ANSWERS* = *ACTIVE-RULE-PREDICATES* = VB-CLASS? NN-CLASS? NNP-CLASS? ADV-CLASS? JJ-CLASS? PART-OF-GEO-PHRASE? PART-OF-HUMAN-PHRASE? PART-OF-ORG-PHRASE? PART-OF-CORP-PHRASE? ELSEWHERE-AS-ORG? ELSEWHERE-AS-GEO? ELSEWHERE-AS-HUMAN? ELSEWHERE-AS-CORP? PERSON-WORD? CITIZENSHIP-WORD? DATE-WORD? CARDINAL-WORD? STATE-WORD? GEO-CONTINENT-WORD? GEO-THE-WORD? GEO-MAJOR? GEO-ABBREV? COUNTRY-WORD? ORG-WORD? TITLE-WORD? GAZ-WORD? SPOKESPERSON-WORD? SUBSIDIARY-WORD? CORP-WORD? *TRAINING-REGIME-NAME* *INTERNAL.-LOCS* = NIL = (:LWD-1 :LWD-2 :RWD-1 :RWD-2) *RELEVANT-OBJ-TYPES* = (:LEX :POS :PREDICATE :WORD-LIST) *ALIGNMENT-LOCS* = (:LEFT-1 :LEFT-2 :RIGHT-1 :RIGHT-2) *SCORING-FUNCTION* = :F-MEASURE *REFINEMENT-SCORING-FUNCTION* = :YIELD-MINUS-SACRIFICE *F-MEASURE-BETA* = 0.8 *MAX-RULE-CONDITIONALITY* = 1 *USE-RULE-PREDICATE-INVERSES?* = NIL *SET-COMPLEX-LOC-RESTRICTIONS-RANDOMLY?* *SIMPLE-LOC-RESTRICTIONS* = :* *PHRASER-LANGUAGE* = NIL *ALIGNMENT-YIELD-OUTLIERS-FACTOR* ,,; ; ; NIL *IRRELEVANT-KEYS* = NIL *RELEVANT-ANSWERS* = (:NT-CONTRACT-RIGHT-1 :NT-CONTRACT-LEFT-1 :NT-EXPAND-RIGHT-1 :NT-SEED :NT) *MISALIGNMENT-RESOLUTION* = 3 ; ,,; = :NT-EXPAND-LEFT-1 = 2 *TRAINING-SET-NAME* = "Unspecified" *UNIVERSE-SIZE* = 300 *INITIAL-PHRASING-TYPE* = :NO-PHRASING *MAX-RULE-CONDITIONALITY* = 1 *INITIAL-TEST-SENT* = NIL *UNIVERSE-SIZE* = 300 *TESTING-SIZE* = 100 *TRAINING-SIZE* = 200 *CACHE-RULE-PREDICATES?* = T *RESOLVE-MISALIGNMENTS?* = T *RESOLVE-MISLABELINGS?* = T *RESOLVE-INITIAL-PHRASING?* = T *REJECT-REPEATED-RULES?* = 6 *AFFIX-LENGTHS* = (2) *GENERATE-IGNORE-RULES?* = T *NORMALIZE-ALL-STRINGS?* = T *GENERATE-NUMERIC-COMPLEX-OBJECTS?* = NIL *ALWAYS-PURSUE-REFINEMENT* = T Author: Daniel Hu MIT-EECS Page 97 5/29/01 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text *ENABLE-GROWING-IN-EXPAND-PHRASE* = NIL *PHRASES-FROM-PREPROCESSING* = NIL *GENERATE-IGNORE-RULES?* = T *EPOCH-DEPENDENT-CODE* = NIL *RANDOM-TIE-ARBITRATION?* = 50 *RANDOM-LEXEME-SCORING-RATE* = NIL *RANDOM-SENTENCE-SELECTION-RATE* = 50 = *ACTIVE-RULE -PREDICATES-FOR-N* *ACTIVE-RULE-PREDICATES-SOURCE* ;; = "Unspecified" *ORIGINAL-TESTING-SENTS-SOURCE* = "Unspecified" ;; *ORIGINAL -TRAINING -SENTS-SOURCE* ;;;(length *all-sentences*) = 0 ;;;(length *training-sentences*) = 16227 ;;;(length *original-testing-sents*) = 2641 ;;;(length *original-trainingsents*) = 16227 Data generated on Thu 3-May-01 at 12:05:23 PM Applying learned rules to TRAINING SET sentences. Key No. -------- Ans Corr Miss Sprr R(%)[Lbl] P(%)[Lbl] 3665 2655 2600 2531 2531 2440 2243 2158 2187 2109 2110 2129 2129 2083 2060 2006 1960 1920 1923 1901 1901 0 2070 1968 1873 2463 2366 2169 2076 2098 2019 1625 1301 1441 1384 1357 1303 1257 1217 1210 1188 1254 0.0 27.557981 29.058662 30.941338 30.941338 33.424286 38.799458 41.11869 40.32742 42.45566 42.42838 41.90996 41.90996 43.165077 43.792633 45.26603 46.521145 47.612553 47.530697 48.13097 48.13097 32.792206 35.11375 37.712006 31.52627 34.11306 39.599 42.059727 41.331097 43.524475 48.899372 54.1417 51.59557 53.33783 54.186363 56.009453 57.562458 58.912895 59.01084 59.7561 58.449303 F-measur Fired +--------------------------------------------+- 0 3665 3665 3665 3665 3665 3665 3665 3665 3665 3665 3665 3665 3665 3665 3665 3665 3665 3665 3665 3665 3665 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 3080 3033 3007 3597 3591 3591 3583 3576 3575 3180 2837 2977 2966 2962 2962 2962 2962 2952 2952 3018 0 1010 1065 1134 1134 1225 1422 1507 1478 1556 1555 1536 1536 1582 1605 1659 1705 1745 1742 1764 1764 30.53 32.47 34.74 31.30 33.84 39.28 41.69 40.93 43.10 46.15 48.61 47.33 48.85 49.59 51.26 52.68 53.92 53.93 54.61 53.94 0 3080 372 346 590 597 319 245 502 104 395 343 140 170 55 66 47 178 118 34 66 Testing learned rules against independent test set of sentences. No. --------- 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 I Key Ans Corr Miss Sprr 728 560 557 548 548 525 497 483 497 482 482 482 482 475 474 471 468 463 465 460 0 355 346 330 428 405 377 363 377 362 308 277 296 287 286 283 280 275 276 271 I R(%)[Lbl] P(%)[Lbl] 0.0 23.076923 23.48901 24.725275 24.725275 27.884615 31.730768 33.653847 31.730768 33.791206 33.791206 33.791206 33.791206 34.752747 34.89011 35.302197 35.714287 36.4011 36.126373 36.813187 32.12237 33.075436 35.294117 29.605263 33.388157 37.993423 40.296055 37.993423 40.460526 44.40433 47.036327 45.38745 46.851852 47.037037 47.59259 48.148148 49.074074 48.794064 49.721706 F-measur Fired +-------------------------------------------+- 728 728 728 728 728 728 728 728 728 728 728 728 728 728 728 728 728 728 728 728 0 523 517 510 608 608 608 608 608 608 554 523 542 540 540 540 540 540 539 539 Author: Daniel Hu 0 168 171 180 180 203 231 245 231 246 246 246 246 253 254 257 260 265 263 268 MIT-EECS 27.86 28.53 30.25 27.49 31.00 35.28 37.41 35.28 37.57 39.56 40.80 40.03 41.25 41.41 41.90 42.39 43.20 42.92 43.74 0 523 53 52 98 109 57 46 99 16 54 31 19 20 2 4 6 40 25 6 Page 98 5/29/01 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 728 C2 Evaluation output of the manual phraser * * * SUMMARY SCORES * POS * ACTS COR PAR INC I MIS SPU NONI REC PRE UND OVG SUB ERR ----------- organizatio person location other timex date time other numex money percent other I I I I enamex 801 0 0 0 9341 684 0 0 0 0 0 0 0 01 01 01 01 0 0 0 0 0 0 01 01 01 0 0 0 0 0 0 01 01 01 01 01 01 01 01 0 0 0 01 01 01 01 I 73 0 0 0 15 0 0 0 27 0 0 0 0 0 0 35 0 0 0 01 01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 01 0 01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 27 6 39 27 0 0 0 0 0 35 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 I ----------- I 1602 I 186811292 0 761 234 500 01 81 enamex numex timex 9341 01 01 684 0 0 0 0 0 01 01 01 117 0 0 250 0 0 01 0j 01 85 0 0 I I 0 0 01 01 01 01 76 85 0 0 65 73 0 0 15 15 0 0 27 27 0 0 11 0 0 250 250 0 0 0 0 42 35 0 0 0 0 0 0 0 0 0 0 01 01 01 01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 01 01 01 01 0 0 0 0 0 0 0 0 01 01 01 01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 761 234 500 0| 81 761 117 01 117 801 801 0 0 9341 608 9341 684 0 0 0 0 01 01 0 0 0 0 01 01 I 01 01 01 01 0 0 0 0 0 0 0 0 I 01 01 01 01 0 0 0 0 01 01 01 01 0 0 0 0 0 0 0 0 186811292 0 I 1602 0 I ----------- P&R 74.47 F-MEASURES Author: Daniel Hu 15 0 0 I I ------------------------------------------- ALL SLOTS 73 0 0 ----------- ------------------------------------------- SLOT SCORES enamex text type status alt numex text type status alt timex text type status alt 15 I I 801 0 0 69 ----------- ------------------------------------------ OBJ SCORES 0 85 0 0 0 250 0 0 0 I I 0 0 0 01 01 01 01 117 0 0 0 ------------------------------------------- SECT SCORES doc 10 * ------------------------------------------ SUBTASK SCORES | 43.31 48.81603 1 36.813187 281 460 268 549 20 MIT-EECS 15 69 2P&R 71.19 27 6 39 P&2R 78.06 Page 99 5/29/01 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text Appendix D Supplemental Wordlists for Phraser Dl Surname list ________ ____ tPi till a -- _QF Pal Author: Daniel Hu MIT-EECS Page 100 5/29/01 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text R T 3f H V IM 1±9 Eh Foreign phonetic translation characters D2 ~ A 7K PU V Fq ELt rdf -T ____ J-- ±JIM inTL -4 t-i 4f T MAI-* 4:R M, V' f4 62k 1 N4 _____ D3 __ __ __ __ ±_ A JJ _ __ _ _ Organization and Location Suffixes Author: Daniel Hu MIT-EECS Page 101 5/29/01 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text Org Org 9 'AA Author: Daniel Hu Org Loc Loc tA FJV$ i4 il A J# tf t MIT-EECS Loc Page 102 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 Appendix E Sample Training Data and Other Pre-tagged Corpus El University of Pennsylvania Chinese Treebank project <DOC> <DOCID> CB302002.BFY 1094 </DOCID> <HEADER> <DATE> 1994-09-02 09:31:23 3B:spcOl 1 </DATE> </HEADER> <BODY> <HEADLINE> ((IP-HLN (NP-SBJ (NP-PN (NR rP)) (NP (NN RK)) (NN Wf) (NN *Ak))) </HEADLINE> <HEADLINE> ((FRAG (NR $1j4P±)(NR 1-i) (NT 9 ,q) (NT 2 H) (NN ,) (NN 9) (PU :) *R) (VV % </HEADLINE> <HEADLINE> h Pa)) (PU , ) (NN MJZ-) (NR -EfJ,) )) (NP (NN RNA) (PU, (ADVP (AD E)) (VP (VV EM4))) (DEG Vj)) (NP (NN *11))) 1K5r)) (CC (NN Wf) (NN $R))) (VP (VP (VV M()) (QP (CD T) (CLP (M 4))) (NP (NN 2))) (PU. ))) </S> <S> ( (IP (NP-SBJ (DNP %( (NP-PN (NR jj (NR rPM) (NN ) (NN N))) (NP (VV 3(*2)) RA ) (NN W) (NN ) ( (FRAG (NR V%$$1) (NN i2t) </HEADLINE> <TEDNPT> <S> ((IP (NP-SBJ (NP-PN (NR rf)) (NN rj) (VP (VV &) (NP (NP (NN -M)) (NP-PN (NR (PU, ) (AS -) (NP-OBJ (NP-APP Fr&)))) (PU, ) (VP (NP-TMP (NT (NP-PN (QP (OD MA) (CLP (M Tii-)) (NP (NN 1[K)))) (NP (NN ) (VP (ADVP (AD )j)) (ADVP (AD i)) RK)) (NN (VP (PP-BNF (P -) (NP (PN 4t))) (VP A) (NP-OBJ (CP (WHNP-1 (-NONE- *OP*)) (CP (IP (NP-SBJ (-NONE- *T*-1)) (VP (VA f))) (DEC Ij))) (NP (NN CIM)))) (PU . ) </S> <S> ((IP (DP-SBJ (DNP (NP (NN RK)) (NN fV) (NN *2I) (NN RM)) (DEG Rj)) (DP (DT t) (QP (CD -) (CLP (M t))))) (VP (ADVP (AD 9)) (VP (VV ?-FTf) (NP-OBJ (DNP (NP (NN JZ) (CC f[a) (NN KJf)) (DEG V)) (NP (NN j V)))) (PU , ML) (NN ))</S> <S> ((IP (LCP-TMP (IP (NP-PN-SBJ (NR P$) (NN AR) (NN AJf[])) (VP (VV %)) p)~Qj)) (LC ,)7})) (NP (NN lip))) (VP (ADVP (AD qN)fl) (VP (VV L)) (NP-OBJ (NP (NN ?±h)(NN ?All) (NN (PU, (NN -zlm) (PU, ) (NN *X)) (NP (NN ) (PU, )) ) </S> <S> ((IP (IP (NP-SBJ (NN jgfil) (NN TE-)) (VP (VV f)F-) (NP-OBJ (NP-APP (NN y-M) (NN f4J) (ETC i4))))) (PU, ) (IP (NP-SBJ (NN S4t)) (VP (VV 9Ti) (NP-OBJ (NN f:lt)))) (PU, ) (NP-SBJ (DP (DT 51)) (CC [) (NP (NP %)) (NP (NN )) ) </S> <S> ((IP (PP (P f yt) (NP (DNP (NP (NN RNA) (NN f4z) RAy) (NN -$R))(VP (VV (NN *2N)) (DEG C,)) (QP (CD -)) #f4M) (VP (VV 4W))) (PU, (NP (NN nf4)))) (PU , ) (NP-SBJ (NN )) ) </S> <S> ((IP (NP-SBJ (NN r*)) (VP (DP-TMP (DT Z) (CLP (M *))) (ADVP (AD a)) (VP (VV SBJ (-NONE- -PRO*)) (VP (VV Mf) (PP (P -f)(IP (NP-SBJ (-NONE- *pro*)) (VP (VV Jff) M) (NP-OBJ (NN (NP-OBJ (NN #;))(IP (NPRWA) (NN </S> <S> Author: Daniel Hu MIT-EECS Page 103 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text ((IP (NP-SBJ (DNP (ADJP (JJ k~tI)) (DEG 1J)) (NP (NN Et4 (It) (NN /fjg))) (VP (VV #Fe) HMA) (NN (NP-OBJ (NN *,))))(PU, ff) (NN jt))) 5/29/01 (VP (PP-TMP (P -T) (NP (NT ))) </S> <S> ((IP (NP-SBJ (NN </S> RM)) (VP (VP (VV fi) ) (VP (VV IRh) (NP-OBJ (NN -JW (NP-OBJ (NN 1%j%))) (PU, )))) (PU, ))) <S> ((IP (NP-SBJ (NN R&X) (NN f$) (NN WI)) (VP (ADVP (AD AJ)) (VP (VV iRK))) (PU. )) ) </S> </TEDNPT> </BODY> </DOC> E2 Beijing University Institute of Computational Linguistics People's Daily project 19980101-01-001-001/m 32 t)/v 3fj/v tT/n 1'A/u V/a ft&2/n -/w -tNtRTE/t fj*f/t i/t4/n (/w [J/v R]1)t/n 1 /m I/q ) /w 19980101-01-001-002/m *rpj$%/nt jip, iF/n , /w M*/n /n /r /r (/w -)L)uL-L*/t +---^ j /t ---- H/t ) /w 19980101-01-001-004/m 1 2fl/t 3 1 H/t , /w tMr%nt ,4 jE/n , /w M /n Mj:/n E/nr fMi/nr "Q/v 1 9 9 8*/t f */t f~tij/n (/w )E j/v 3fr/v tM/n j/u f/a ft2/n ) /w , /w (/w WjRp/nt jE2/n M/nr fEf/nr fiNg ) /w 19980101-01-001-005/m Fvl 2/n fiD/k , /w NM /n fl/k , /w ±in ffl/k , /w tt/n flF/k : /w 19980101-01-001-006/m 7±/p 1 9 9 8 -/t *1/v 2VTh/f , /w V/r +±4/m Aj'/a M/u 3Aj4/p [rpq/n Af/n yf/vn *9/n]nt , /w [tp/ns E[j/n T-0/vn t-/n]nt ff/c [tI/n 4tWP/n]nt, /w rj/p A [/n R/r AR/n /w Frq/p [WA/ns J IJ/a fiMI/n]ns rvn/n , /w irt-/ns fI/c I-/ns r[ll /n /w if-I/s inh/n , /w r',/p ftW/n &-f/r ro/u M /n i1/k , /w RJ;A/v i)J/a 0j/u FnoR/vn fa/c af/a nJ/u $5-/vn ! /w 19980101-01-001-007/m 1 9 9 7 F/t , /w i/v 't /ns RI/vn M t/n I/f 4F,/d _I 7/a ftj/u JR/d T/d TYL/a 1Yj/u -/m */q , /w tF[/ns A /n ZlC/d J*/v y$/nr /lA/nr MrIC/n OJ/u Mi$/n , /w g/v fp/p EAJ/v ;/v 19980101-01-001-003/m 1P/ns J$1t/n ±-1_X/n *L/n *Fr /v (i3 /v /w [r'F/ns i[jj-/n]nt Rpftj/ad Fkk/v Xf/p Wt/ns fif*/v it7/n "/w -MM$1i/j "/w , /w "/w iAA/I "/w , /w A. q/d lfJ/v j/u #J-/n Rflf/v W /ns jR/u /w #/c ,M/p %50/an Mt/an , /w [t[/ns JUM/n]nt frfj/a it/u Bf+/v -/u ±+H/m R/q -M,-/n , /w A#/v E/J\TM/n f1 /a )54P/n , /w , f/v -i/m gjt/n , /w RT/v fj/a 0J/u t22/n , /w $ij]/v T/u 4ifins , P-/v ft 2/n Q-)/v /u ffj /vn %q))/n . /w 19980101-01-001-008/m r/p Z/r -/m */q r/f , /w r11g/ns Cj/u Ekg/vn f){/vn fo/c 3JftJL/vn lti/vn 9Vq/v Fjglij/v LA/v, /w NRfQV/n fM /v T/u "/w A/a it-K/vn , /w f/a AM/j"/w WJ/u AR/a &M/vn 3!/n , /w 'K/n tfj,/vn XR/d RJf/v f/a 0j/u 4}(y /n , /w LVn 2kT*/vn ?[AK/v RjL/v , /w ) /n tit/vn 3A-t/d 21*/v . /w x4$l/vn /,Mj/n 1J*/n 6-fF/vn 4/c Zjp{/vn TF/d If'-/v . /w R]5/a f~iJ/n Ej/vn , /w M$$ZAn-/n xLt/vn fU/c ft/r -Ji/r k/n X/d ;/v Wj/a 0/u MA/vn , /w fill/r +4/m MjI/v fRA/t -t/m RTfl/n - /m *%/n fo/c Jtf[/n t/v In/u &M/n 3/n , /w ffl/r tfff/v AA/p A /r M*/n @f/c *Il-/n j/u rhf/an J;) !/c qM/v aj/u MjT,/n 6ff/vn , /w Mt /n -/v At'/d 4IJ/v WV(/vn , /w M j~iM/c, /w rP/ns A*/v fn/c &)F/v( ,/u JP/ E /v Rf4/v Tf/u V, /an . /w 19980101-01-001-009/m t/p A/r -/m */q r/f , /w tg/ns fj/u +/n i -/vn W& /v T /u *3/a jj%/n , /w A Lf/p A.jG/n T-hj/v , /w 't'H/ns 4/p M/ns , /w fRf/ns /w AM/ns , /w H;/ns N/u kI/n Tf /v T/u yRj/n y ./-/n k*/t &jR/v 0j/u & 4RPM / O/u i&)/vn , /w A't/r fi2/n 8 *-/a 6fF/vn iiT- /d #jt/n , /w rPM/ns 4/p ThLJi/n M*/n fn/c r'/b h)nd/v. 1w *R/ns VUR/ad *-/v []E/j /j M/j /n]nt 0j/u VT,/n f5/c Th:±/vn lhP/v T/u -9/ns -/w r/j H/j 0f/j Wf/c 4M/ns -/w ,:/ns -o/n 4IE-/b -u4/vn , /w /vn , /w f-v fT/n 4/c Rt/v f/u EtIft/n MN/n , /w 1Ills/v ftW/n j rq/v : ,tRJE/v In/u Author: Daniel Hu MIT-EECS Page 104 1 5/29/01 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text +M/n , /w xI4P/p QjA/v NBjT/n $±,/n fj/u &cf/a -ftff/vn 1/c )MF/b l/vn ffh/v -f/u 111R/a 1Y/u af/n /w 1t11/u Rffj/v f/a @j/u )tkR/n . /w 19980101-01-001-010/m 1 9 9 8 */t , /w 4i /ns )AN/n V/d jf{/I ffTlt/n /u 3v / /c 1fl/r :'F/p 4/3/n 4k/n c)/v 4i/f a/d Aj;[,/v Tj.//m WE/an , /w fR/c I{1/r 2M/v 9+4k/v Ti/a 2 0/m */q */f VM/v fjj/u {Mt/a p9±/n fo/c 17/v WJ/u 4 /a H/n 11,/vn , /w /v , /w j/d W/v -+9t1/r j/u ;ft/r fIll/a *{4/n , /w ,jf/r -Z/d f /v -Jl/v ALt/r fI/an , /w lgg/v / /w 9111/v tKA1/n , /w ffCAV/1 , /w L/v +/i,/w g3I/I , /w Rr/c Rfj-l/r A-9/d E4 4Yn h1-tn jlu JM/n A/c l/v W/d W/v X/d ?tr/a . /w t'ns 19980101-01-001-011/m 30N/v RX/n 1Y/u lj/a Y-/vn , /w j/v r$in/s ±/ /n 4i/ns A/n W/v j/u Mi4j/b { /n /w AiA/p q/j X/j 9V)/n W/u 9Tfh/vn 1/c 4R/an , /w Vj5/p "/w -FM$1j/j "/w 7ft/n f/c frK]/ns /d %, /v PrVi/ad 5WJN/v . /w (/w 4 :A/n ) /w , /w 1 9 9 9 4/t 1 2 fl/t jq /ns Wj/u H )-/vn 19980101-01-001-012/m gns i/v 4i[/ns fi±/n T+5)-Wj/ O/u -/m %f/n . /w -/v ?fl/n p-/vn ,w M/v AMTrg/i G-L/ns /w "/w tY/fn /c , A 4/n /j "/w $111+/v "/w A/m -'j/q rtR/ns "/w ,w "/w --4f4/n I4/ns fi/n J7)/p ,9R/v . /w tT/v W5/d jjt/v 3/v "/w 0j/u Ni*/n , /w Jt/v /w PiL'fr-Ij'/I 1w /w -/r cii f/v /v , /w */v fh/v 11k/n , /w 4Mf/v f4*/vn , /w QA/v At/n EAf/ad I1/v Oj/u R-/d Ft //v ThIjF/r 4;/a C/u ffj-i/vn , /w *4 1/v AjM/n fiP/n 4Th/n 4Zjif/vn A/v , /w AM/v , /w AA/v j/u * H/d 4 !/v , /w #/c /n/m rti/ns fj/u W_)Q]/n -F/f MjW/n Aff/v 4$1j/vn In/u /P1V/a ufSF/vn E/p -- /w 19980101-01-001-013/m 4fi/v -A4Z/n , /w H /d jtj9/a 0j/u ffn/n jj/n I$*/vn , Wjl/n @/u tR/vn -N19±/v 9194/n f)IA/n . /w /A/c, /w IEI/d M/p 6jf/r /w pJL/f 14'1/f (9,j/u 2;9±/n ?Vpp/n 4914/v #t1/v ; /w 9.1 1/n 9145/vn t9±// A 4/vn, WT/a 4R/n /w HV1r)]JP/i O/u /w tfW/n AI/d T/d /w T/d 'iIE/a T/d rt/p *711/a 0/u ; /w &RFVt1+9/ t/a /u H6/a Wj/u MPTl,/n i&k9/n /M1/n #F1/n a/d &3/Wv Vr/a /vn Rj/vn r1/f 0/d -T/v 15y/n 1iib9±/n ; /w Aj,/n Wj/u h9/vn 4/c &H/vna/d 1ilr/v ##f1/q MFT1/n P±9M/n 9 19/vn 1/c V1a/vn . /w f 111/n 4/c k)H/vn W/u 4r9/n I/v 41//a Wj/u , /w 2 1/m ft9//n /i /d I/v 9±V/v tT/n O/u ±/H/n . /w f9/c 49±/v 4/u &M/n T/d W/v t/d T/d 4/v -4*1LF)A/i , /w 491 /n 9±/v 9±/n FT1/n i6/n N /v /IE/a Ag/a 4/u - /r Ak/n f./v A-1/d R/ /a La*/v , /w J±M19/d 4)j/v * j/d 9±9h/n 9/a #F/n , /w 19980101-01-001-014/m [liM/ns RIf/n]nt M/d /914/v 9±4/v #ff/v jtA1 /i W/u 114/n A34/n 1+M/n , /w H/p 1149/I E/m j /q W+31111/n 4/u Ma/n _f/f 9±,/i/ad QM/v 91/p (±W/n Wl/r 4/u f/a 44/n . /w rPM/ns 9±9±/v huN9/v .1/p J(9+9±/nt f/c 44/r 9T/n /l94/n 4/u t),ii1/vn , /w QA/v 9E/p /t)/v 9±9±/j f4/n 3PL/vn , /w R4P/v qfj/n , /w r [;,/v ,±F/an , /w /44/v +11+/n J1W1/vn 4/u X+1/n 4/u +11+/n -6fM/vn . /w rli9/ns 49±/d 9±/v /91P/v ftW/n f 111/n 4/c V±9±/an 4/u 9±9±/a 92/n , /w r+1/ns Ak/n 9±/v 4/p ftw/n K+1/r \,/n -9/d , /w Y1/p 9f+aj/v rrX/a fqT/n , /w 4911/d 9M/v 4/u 9±/a 9±/H/n i/i/c Ttv 9±/ ! /w 19980101-01-001-015/m 9±/p A/r n±194r// 4/u /ff/a 64/ilJ/n , /w R/r ?R/v t*/r V4:-/t 1t$19/a , /w *bl/n 4/ 9/a ! /w 19980101-01-001-016/m 9i+9/v E3 ! /w (/w T±P/$/nt JL9±/ns 1 2f-/t 3 1 H/t */n ) /w MET 2 and BBN data <DOC> <METID> 001 </METID> <ID> raw/950715/tg/gnxw0718.15 </ID> <DATE> 1995 07 18 </DATE> <HL> 9r'19±/ </HL> <AU> Domestic News Dept. </AU> <TEXT> - -<ENAMEX TYPE-"LOCATION'> Author: Daniel Hu TYPE="PERSON">9 +W</ENAMEX> L </ENAMEX>t :% <TIMEX TYPE="DATE'>9 5</TIMEX><ENAMEX <ENAMEXTYPE-"LOCATION'>f #</ENAMEX>W P MIT-EECS M$/V4 i , <TIME Page 105 5/29/01 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text t*@ </ENAMEX><ENAMEX ) 1 6 H</TIMEX><ENAMEX TYPE="LOCATION">M * n <ENAMEX TYPE="LOCATION">{t Wfr</ENAMEX>, TYPE="LOCATION"> 1 i</ENAMEX> f, X TYPE="DATE">7 TYPE="LOCATION">i0</ENAMEX>, <ENAMEX TYPE="LOCATION"> F</ENAMEX>, <ENAMEX <ENAMEX TYPE="LOCATION">Ep</ENAMEX>, <ENAMEX TYPE="LOCATION">g F</ENAMEX>, itig @gg W TYPE="LOCATION">Whfj3J</ENAMEX>, <ENAMEX TYPE="LOCATION">@@</ENAMEX> ,Il W Ai±l-W 6 f #AXhn T# MiLh<ENAMEX , <ENAMEX <ENAMEX TYPE="LOCATION">MW</ENAMEX>R TYPE="LOCATION">N</ENAMEX>, ! 1 5 <ENAMEX TYPE="LOCATION">t @</ENAMEX>TL<ENAMEX i-ir</ENAMEX> <ENAMEX TYPE="ORGANIZATION">W j$3 3-iaMF</ENAMEX> TYPE="ORGANIZATION">) iNitt E3i 2< @@</ENAMEX>4fg%: ENAMEX TYPE="PERSON"> <TIMEX TYPE="DATE">1 4 H </TIMEX> ZT, RUW TYPE="LOCATION">@@</ENAMEX>i[-Jnvgct, YI ± <ENAMEX 3. T, cA*RCMC 41-Wi<ENAMEX TYPE="PERSON">$@@</ENAMEX>fg-%: <ENAMEX TYPE="LOCATION">r M</ENAMEX>2L<ENAMEX 2 5 0 V-Eqf J1EV, TYPE="ORGANIZATION">TIEON'~</ENAMEX>jy$llg </TIMEX>3a MJZ, AfRt 6 *, 3 3. L Af </ENAMEX>e X<ENAMEX TYPE="LOCATION">r 3 1* n <TIMEX TYPE="DATE'> 1 5 H R 4. VtIE@A k1V%9EjT, : (<ENAMEX TYPE="LOCATION">!R TYPE="PERSON">jt*M$</ENAMEX>fg- ji_U:<ENAMEX </ENAMEX>7-R) TYPE="DATE'>1 3 H </TIMEX>Af<ENAMEX TYPE="LOCATION">i#</ENAMEX>W 3 t,, J) % VF$1'i@ <TIMEX 6. r.]XIjf-]MN<ENAMEXTYPE="LOCATION">RI</ENAMEX> V, @-, 5. <ENAMEX </ENAMEX>74@2%j 2-%T<ENAMEXTYPE="LOCATION">4 TYPE="ORGANZATION>@ t</ENAMEX>3P,: E ±<TIMEX TYPE="DATE">'5tI 1</TIMEX> tji3vj, <TIMEX 4<ENAMEX TYPE="LOCATION">P M</ENAMEX>&> 1,- []I A j @% 7 TYPE="DATE">'541$V & fT±</ENAMEX>iP& WE--)tM </TIMEX>-+ ,L-4iPF4MinfAid<NUMEX : [@4 TYPE="MONEY">bALrzi{Zi</NUMEX>. <ENAMEX TYPE="ORGANIZATION">t 6. @j<ENAMEX TYPE="ORGANIZATION">M Mf-T</ENAMEX> Xeif]MTLR, <ENAMEX TYPE="LOCATION">*M</ENAMEX><ENAMEX TYPE="LOCATION">-[34</ENAMEX>j<ENAMEX 3., TYPE="LOCATION">'jlI</ENAMEX> </ENAMEX>, <ENAMEX <ENAMEX TYPE="ORGANIZATION">M_±I- -1jA</ENAMEX>, <ENAMEX TYPE="ORGANIZATION">%r2</ENAMEX>, TYPE="ORGANIZATION"> H 4$LJ-F</ENAMEX>, TYPE="ORGANIZATION">- W7<NUMEX ,jTh ± j- 3+%R /cA]r </ENAMEX>Nitb t%&ifje TYPE="LOCATION">,'J'fl</ENAMEX>ffFK, L- _M<ENAMEX TYPE="ORGANIZATION">; [t±M , IJ<TIMEX TYPE="DATE">-'46A fl</TIMEX>, iIh TYPE="MONEY">- TYPE="ORGANIZATION">fT±</ENAMEX>A,: - 5P<NUMEX>L, E<ENAMEX TYPE="LOCATION">t 7. - h %%-# --*g, MMW , {i t 2R -- >-F , iimtLyL. iE </ENAMEX>X(NR ,gt, 3 TYPE="LOCATION">ijj </ENAMEX> -- 1M iA.We, </NUMEX>, EX>LF 9. -- ± H </TIMEX>j-t4M3M & <ENA<ENAMEX Pjzt<ENAMEX -f<ENAMEX TYPE="LOCATION">Riiff</ENAMEX> IM)NJd 1 , ", <TIMEX TYPE="DATE">-V114</TIMEX><ENAMEX i<TI[K.MII<TIMEX TYPE="DATE">--ift </ENAM <ENAMEX A ±</ENAMEX>j* TYPE="LOCATION">Riiij</ENAMEX>-Aihl, TYPE="LOCATION">4' I <TIMEX TYPE="DATE">E- 8. <ENAMEX TYPE="ORGANIZATION">4$ , H -M R <ENAMEX </ENAMEX><ENAMEX TYPE="LOCATION">-~</ENAMEX><ENAMEX TYPE="LOCATION">fl$ </ENAMEX>nf i TYPE="LOCATION">iT'</ENAMEX>3Z <ENAMEX E*'t]Egp<ENAMEX % <ENAMEXTYPE="ORGANIZATION"> 4A)</TIMEX>, <ENAMEX TYPE="LOCATION">fij ij) ] A <NUMEX TYPE="MONEY"-N-~ EY' > WT±</ENAMEX>r A T :<ENAMEXTYPE="LOCATION">$5 </ENAMEX><ENAMEX TYPE="LOCATION"></ENAMEX>MX-ffrA§gi -- <ENAMEX TYPE="LOCATION">MihgIIfa g fi4</ENAMEX> Hid BM jVlj,</ENAMEX>f Author: Daniel Hu 2.T<ENAMEX Rf, <ENAMEX TYPE="LOCATION">MM ml a TYPE="LOCATION">M M</ENAMEX>*%<ENAMEX TYPE="LOCATION'>ft*2 MIT-EECS Page 106 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text </ENAMEX>i#F', gg<ENAMEX TYPE="LOCATION">MMIA</ENAMEX>4--^- &W*fjA, M, WM -r flXIR-48, , f~e TYPE="ORGANIZATION">±V-T</ENAMEX>fP,,: :J AylMqA!A @ i, <ENAMEX TYPE="LOCATION">1Lf</ENAMEX> f,)</ENAMEX> -jT , . <ENAMEX TYPE="LOCATION"> 4tmli R $JJfln 4 V-tVU, NW1JPL, 1 O<ENAMEX 4M<ENAMEX TYPE="ORGANIZATION">A-T-IN J </ENEX>A) MEX>, </NUMEX>, <ENAMEX $</ENAMEX>E ffW 2, TYPE="LOCATION">t 5/29/01 rE/A. M bjii@-Ef R t Ma-i {G<NUMEX TYPE="PERCENT">-f rfi <ENAMEX TYPE="LOCATION">AL J |T0Ji. -AA7 E7-^ 0 0 0 t<ENAMEX TYPE="LOCATION">L-T</ENAMEX> /ENAMEX>~i~ i/ t -31ithI]i, #R @l/-F###R W R 955@lt9]R _<NUMEX TYPE="PERCENT">-2ii-+</NUMEX> (<ENAMEX TYPE="PERSON">M</ENAMEX>M<ENAMEX TYPE="PERSON">)i </ENAMEX>M 1 8/7) </TEXT> </DOC> E4 There There There There There Frequency data from PKU People's Daily project 337724 words of the length 1. 492879 words of the length 2. 51116 words of the length 3. 19375 words of the length 4. 5059 words of the length 5. are are are are are POS Total 1 Char 2 Char 3 Char 4 Char Other Ag na Mg vd 321 1 7 1 528 321 0 7 0 17 0 0 0 1 501 0 1 0 0 10 0 0 0 0 0 0 0 0 0 0 nr 35262 18537 13927 1415 623 760 Yg 1 1 0 0 0 0 a ns nt b c d vn 34473 27890 3573 8715 25479 47811 42734 12834 0 0 1913 18923 29845 127 21577 18584 7 6120 6228 16846 41747 54 7297 3118 646 176 1117 857 8 1354 416 26 152 3 3 0 655 32 10 0 0 0 e ad nx f h nz i Bg 27 5933 452 17201 47 3700 5033 8 23 496 112 12627 47 0 0 8 4 5435 49 4499 0 2090 0 0 0 0 133 75 0 1034 8 0 0 2 50 0 0 437 4966 0 0 0 108 0 0 139 59 0 j 10323 3723 3896 2169 518 17 k Dg 1 n an 928 135 6015 236810 2837 70 39925 24244 32327 3836 20680 4562 74829 928 135 0 16846 68 9 36339 22048 13800 0 0 4562 74363 0 0 26 186882 2769 51 3585 1909 18395 3655 14019 0 466 0 0 686 26800 0 1 1 235 132 179 3579 0 0 0 0 5104 5258 0 9 0 40 0 2 177 0 0 0 0 199 1024 0 0 0 12 0 0 2905 0 0 vvn o p q r s t Ng u Author: Daniel Hu MIT-EECS Page 107 5/29/01 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text V 184764 1889 10 1391 496 1749 y Rg z Tg Vg E5 There There There There There 118616 32 0 963 0 0 1176 0 0 217 0 0 17 0 0 210 0 0 2 0 0 1 0 0 Frequency data from U. of Penn. Treebank project 27187 words of the length 1. 47216 words of the length 2. 6330 words of the length 3. 857 words of the length 4. 468 words of the length 5. are are are are are POS Total 1 Char 2 Char 3 Char 4 Char Other NR 7664 2651 17 112 4980 1974 239 2252 12170 3226 23 1365 2227 150 3879 29 3271 961 43 65 1155 1372 731 391 4 343 29890 8 864 92 558 14 15 112 2668 1852 4 2252 2424 2692 23 1095 2227 142 3493 29 788 961 43 14 345 1219 731 379 4 341 2099 4 567 92 4555 1783 2 0 2120 122 177 0 9516 494 0 270 0 8 382 0 2289 0 0 51 754 153 0 12 0 2 24226 3 297 0 2077 378 0 0 182 0 27 0 90 25 0 0 0 0 4 0 189 0 0 0 6 0 0 0 0 0 3351 1 0 0 314 121 0 0 10 0 1 0 137 15 0 0 0 0 0 0 5 0 0 0 50 0 0 0 0 0 204 0 0 0 160 355 0 0 0 0 30 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 NT SP BA AD CC CD DEC VV M LB LC DEG MSP P OD JJ AS SB CS VA DT VC VE DER ETC NN FW PN DEV E6 64953 1857 10 0 496 1749 Organization part-of-speech results from People's Daily 337 423 158 130 216 117 81 92 1364 57 160 89 60 130 Author: Daniel Hu nz_n_ n__ ns_n_n_n_ nsns_n_ ns-j_ n n n nt-j_ ns_n_vn_n_ ns_n n-n n n nt_n_ nsnt_ nsns_n_n_ j-n- MIT-EECS Page 108 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 64 70 2487 135 576 342 69 E7 5/29/01 ns vn n ns-nzn-n_ nt ns nz n ns n n n-j_ nznn_ Phrase Tagged People's Daily Data (using Manually generated rules) <DOC> <HEADER> <DATE>19980101-01 -001 </DATE> Text orginally from People's Daily 1998 Segmentation and POS tagging done by PKU ICL Modification and LEX tagging done by Daniel Hu at MITRE NLP April 2001 TAGSET USED: PKU </HEADER> <BODY> <TXT> RN <5>2 A J1Iirttt i# (iA4ft 1 4) </S> <S><ENAMEX TYPE="ORGANIZATION">4 *M</ENAMEX>M3j2, ffT-'E N</S> <S> ( + t: )</S> <S>1 2,q 3 1 H, <ENAMEX %</ENAMEX>) ,-t2iE, TYPE="ORGANIZATION">@± 1 9 9 8 *-i1 tijr E) . </S><S> (<ENAMEX TYPE="ORGANIZATION">W3 -</ENAMEX>j[TE) </S> R&l:</S> &±i-], <S>MF-j,Lqf1,1 , 19 <S>- 9 8 1rifZ, -Jt3ff<ENAMEX R + TYPE="ORGANIZATION">@ 5)A A TYPE="ORGANIZATION">rp45j-L K, P' f W '- Ij )V-f -, A rE t </ENAMEX>, <ENAMEX @ M e</ENAMEX>fj<ENAMEX TYPE="ORGANIZATION">@ iV- </ENAMEX>, Fr'3 AR, ]-!jjf f I- ft R f R f ! </S ></ gr-m jw P> < S> 1 9 9 7 *, J31rP V M t -L M IW &T -T R - </S> < S>@@R gy/sg FMKZ ' - 1 ,k fE L* Pil. </S><S><ENAMEX TYPE="ORGANIZATION">4rJ r</ENAMEX> IrI' ggkgfI_*3, R P g"- s ", "% P". i- g j1rtPj Pffl thmX/52T -TPE='ORGAIZAIN</S><S><ENAMEX TYPE="ORGANIZATION">r@pX ,~l'j</ENAMEX>pr)jojfERifTg+H).! #, J t-. </S><S>N-1AJ1i~x, g9%f~lKG5E~@M -ff*MP-, RT W@ft T &2 tft', EN/sW{*P fa #] </S> N, tt />SR0ei@ %%2%AMTR, 3 f~tHER3%@ERMt}NWM~g&#,MR~ggggg.</S><Syg Wg *d482ethog~f@@2.</S> <S>t A-*r, VMT-NMMd. T Author: Daniel Hu KN~kkMM B151iinit </><S@@$]MMMC MIT-EECS &M@g~k AN />SHE~li 3%,{~ , is@x, Fj*,,tAz W1)%ih</S><S>@P1 Page 109 Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text 5/29/01 fR54Rt<ENAMEX TYPE="ORGANIZATIN">MA6- ) H </ENAMEX>j i , -5N T - H0 f @ -g vg E-f A - I t </S><S>y 1S/</S> 19 <S>1 9 9 8 X>Ryf)lg,3 ,</S><S>YPE=ORGANIZAieNT>E NAE, SE,</S> iP F(M /t i) , 1 1 9 9 9 2 ) An 1- C, 9 - 49P*1 R </S> ±-T1ffegk <S>-h--ArK Oer", "YPE "- ff-lfff, "I </s>sg$3tM~rRf N A" P3, O > L<S><S>'7VV$lfil"Et* W< </S><S>4R<ENAMEX 4,Egne </1>S~j TYPE="ORGANIZATION">-MAM</ENAMEX><AM,, RT </S-><Sg ggf, agT -Jtle, g ig </S> <S>5A$,a <D sigtjeg~ihM, irh#@Wile, Ehe3ih ~ii#)MME,</S><S> -3 , ft ;-,</S><S>f 4 V Eand ,P OSP Sen -T9in N FT a i <tS > <S><ENAMEX </S><S F3 (<NAE T</EN'RAMI TIN>J 4R&)M > 4 2/t31fj,<S>/P </ENAMEX> Hfu aLtMITRJE g!! NLP tig2 @fQ-Tgg, TYPE="ORGANIZATION">nLEtaRgigd </HEADER>P[9ff-</><n> hr3<ENAMEX TYPE="ORGANIZATION">R-, -_-</ENAMEX>fU- ,t<ENAMEX TYPE="ORGANIZATION">EgrT8A$9p</ENAMEX> d MK DAE2 t,)ii, % N%%17%Eijg, ($4f- 1 JT$F;TEW /P></S> 9900-1-00</DTE <S> EAWHOVrO IWB4rll, R$Af~h,%$5!</S> <S>it it ! </S><S> (<ENAMEX TYPE="ORGANIZATION">fT:T$i:</ENAMEX>JL-I I 2)q 3 1 H F4-) </S> </TXT> </BODY> </DOC> E8 Correctly tagged People's Daily data <DOC> <H EADER> <DATE>1 9980101 -01 -001 </DATE> Text orginally from People's Daily 1998 Segmentation and POS tagging done by PKU ICL Modification and LEX tagging done by Daniel Hu at MITRE NLP April 2001 TAGSET USED: PKU </H EADE R> <BODY> <TXT> <S>2 Author: Daniel Hu E # E fr22 AA /lFr~#18(|W [t I q-) </S> MIT-EECS PapIIC Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text <S><ENAMEXTYPE="ORGANIZATION">,M 5/29/01 198</ENAMEX> ' -), M tN+-M <S> ( 8] 9 <S>1 NN+-Z) </S> O T 1 H, <ENAMEX 3 %p</ENAMEX> TYPE="ORGANIZATION">@r tE) . </S><S> <S>r i -f , 1 9 9 8 *7V$1#18 T2E, NMEX 11Jr-[y%) (<ENAMEXTYPE="ORGANIZATION">tT/AE</ENAMEX> jE </S> </S> /tNf<ENAMEX IVAR X2,'R+ <S>& 1 9 9 8 4 TYPE="ORGANIZATION">*g ARCMW- r-=</ENAMEX>, <ENAMEX TYPE="ORGANIZATION">rp g F;T-0t-:1</ENAMEX>fQ<ENAMEX TYPE="ORGANIZATION">r></ENAMEX>,A, %AJU E, ±,I 9N, F-]f W grMIn9 -f 1-, R W A VIi]M 0fD Q , Wj ?R f- ! </S > 9 $2 hlM2OT)-4</>>@Aeg3/s5W 12W5%2?ttX5A MM.</S><S><ENAMEX TYPE="ORGANIZATION">,RFAf-</ENAMEX>gl41f],FgggafWVfftjV, <S> 1 9 7, tE , Mg"ggj,"~g" g Wit~f@ghe@%,</S><S><ENAMEX TYPE="ORGANIZATION">@@MPRl'</ENAMEX>#X'jMENTMKiTHM1 9 9T1 2rfr2, Mit $T fff~+H)Z _ ftmA, MSS/syge-, g , </S> TDMREMMMR &{$2%2% %#MER71L&WM3 tNAMMEIEK</S><S>J "O ' it, 3d442% kgg~fTM2,</S> <S>&R45, yn~kkW B510N Mt, T TYPE="ORGANIZATION">,], $ T @@NR-ffaMTEME,</S><S>AAA-G-TL, A /S>S>@4MiBM1C '-Mp,,</ENAMEX>riiibM, j-,hp- Dri[, 9%2 rPM, T RM &2 MF-q E Fj &1 $0%,</S><S>@@FR NMWo]-&M </S>< &t& MK</S> <S>1 9 9 8-T-, alr@ *ANNO 4 2, *HM r 2 0 *-* </S><S>f \ R ieI t Wt apm@$%Zif Otm t Z5fl ALtI I, IrIJA. </S><S>'RWRfIle]Agg ZW*R, MIRA~, R rpa[Mie.TJ, IN, f-FWl@M V, T", RTi -Mk )f;iMZe, ArI~g~f 2%RnggEC,</S> M Mia ,1 #0&% 9 9 9 * 1 2 AMRM,</>S%5t$]"t <S~egg~giTU~e1] -gg</S><S>'pg6 3", : "-@- t, "e n 4i1 sit2twA $J%</s> </>T>,M VL~gfg 2011FMMMtiM$J@MSWW,</S> Author: Daniel Hu MIT-EECS Page I I I eg agt Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text <S><ENAME, 5/29/01 ,< naTgr n.*RrtR ; .<S<>At|i@EB~f -YPE='ORGNIZATIOAM-E ; T IET&983 l83%3aihTnit </S><S>fF ; 4 %2-9%2% Allimf)#B~tnEA3!@3RiaihiB5.</S> <S><ENAMEX TYPE="ORGANIZATION">>rM-</ENAMEX>AMEX>#jit taCf4n, M7J R djtW G @kWMK</S><S>rP[fftb13p Wa<ENAMEX TYPE="ORGAN IZAT ION ">IK--[M</ENAM EX>fQ Itt VM P O' t jM, f -M, Yl NixAfa-, Irl2 rniE7! gW, -f4AMil7 - ZPL, 8 </S> <S>&WI~a~rMW~sRIJ RTAR~~tA5<, E~tli </S> <S>itif ! </S><S> (<ENAMEX TYPE="ORGANIZATION">f $t</ENAMEX>1LT </TXT> </BODY> 1 2 ) 3 1 H 0) </S> </DOC> Author: Daniel Hu MIT-EECS Page 112 j IF

Development of Part of Speech Tagging ... Analysis Software for Chinese Text

Related documents

Products

Support

Development of Part of Speech Tagging ... Analysis Software for Chinese Text

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib