Development of Part of Speech Tagging ... Analysis Software for Chinese Text

advertisement
Development of Part of Speech Tagging and Syntactic
Analysis Software for Chinese Text
by
Daniel Tianhang Hu
Submitted to the Department of Electrical Engineering and Computer Science in
Partial Fulfillment of the Requirements of the Degrees of
Bachelors of Science in Electrical Engineering and Computer Science
and
Masters of Engineering in Electrical Engineering and Computer Science
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
May 2001
©Daniel T. Hu. All rights reserved
BARKER
The author hereby grants to MIT the permission to distribute publicly paper and
electronic copies of this thesis and to grant others the right to d )VASSACH5ENSTITUTE
OF T ECHNOLflGY
LIBRARIES
Author
Department of Electrical Engineering and Computer Science
May 2 9 th 2001
Certified by
I Amar Gupta
Co-Director PROFIT Intiative
Thesis Supervisor
May
2 9 th2001
Accepted by
Arthur C. Smith
Chairman, Departmental Committee on Graduate Theses
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
Development of Part of Speech Tagging and Syntactic Analysis Software for
Chinese Text
By
Daniel Tianhang Hu
Submitted to the Department of Electrical Engineering and Computer Science in Partial
Fulfillment of the Requirements of the Degrees of
Bachelors of Science in Electrical Engineering and Computer Science
and
Masters of Engineering in Electrical Engineering and Computer Science
Abstract
The goal of this project is to complete initial development of Chinese Automatic
Content Extraction software. This Chinese processing system is integrated with Alembic
system (MITRE in-house NLP software for text processing), and has complete
segmentation, part of speech tagging, and named entity detection capabilities.
First phase development involves establishing segmentation and part of speech
tagging capabilities. Three different segmentation algorithms are implemented and
integrated with the Alembic system, and one off-the-shelf software is tested and
described. Different segmentation and part of speech guidelines are explored. Efforts
include compiling different part of speech wordlists, segmentation wordlists, and
supplementary wordlists to aid in segmentation, part of speech tagging, and machine
training. Results are on par with industry standards. We found that replacement of the
wordlists and resources within the existing segmentation/POS system can results in
significant improvements. State-of-the-art software from external sources is shown to
work well with the Alembic environment. Should we choose not to develop our own
segmenter, license for OTS software can be readily obtained.
Second phase development involves named entity detection. Training is done with
pre-tagged, cleaned, and re-formatted Chinese newswire data. While training shows
promise, higher performances are quickly achieved through manual rule generation based
on analysis of entity POS formations and additional specialized wordlists.
Throughout the development effort, guidelines are tailored to better meet the needs of
Entity Detection and Tracking, and Relations Detection and Characterization.
Thesis Supervisor:
Dr. Amar Gupta
Co-Director PROFIT Initiative
Massachusetts Institute of Technology
Author: Daniel Hu
MIT-EECS
Page I
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
Acknowledgement
I would like to thank Dr. Amar Gupta for providing this opportunity to work with
MITRE Corporation. His support has been non-wavering for the past year. Dr. Doug
Norman has extended to me a helping hand at the most critical of all moments and I am
forever grateful to him for his believe in me.
I would also like to thank Dr. David Day and Dr. Mark Vilain for guidance from day
one of this project. Their juggling act between numerous projects and meetings and
hours with me over the computer is unbelievable. I also want to thank Dr. Qian Hu for
helping me in every detail of the project and the thesis, and for being such a wonderful
friend. I have to thank Ben Wellner for his wonderful tutorials, which has make
impossible tasks simple. I want to extend my gratitude to all of the NLP Group. I have
found a helping hand whenever I needed it.
My dear friends: I couldn't have survived through these five years if it was not for
your understanding and support
Mom and dad: everything I have successfully accomplished in life can be accredited
back to your absolute love and positive influence.
The woods are lovely, dark and deep,
But I have promises to keep,
And miles to go before I sleep,
And miles to go before I sleep.
-Robert Frost (1874-1963)
Author: Daniel Hu
MIT-EECS
Page 2
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
Contents
1 Organization of the Thesis ............................................................
P6
2
P7
3
Introduction .............................................................................
2.2 Building Chinese text processing system based on Alembic system ...
P8
2.3 T hree phases ...................................................................
P9
2.4 Previous studies ...............................................................
P 1o
2.5 R esource challenges ..........................................................
P13
B ackground .............................................................................
P 15
3.2 A lembic system ...............................................................
P 15
3.3 Chinese processing system ...................................................
P18
3.4 Segmentation problem in Chinese .........................................
P21
3.5 Guideline for tagging documents ..........................................
P25
3.6 Part of speech tagging problem in Chinese ...............................
P26
3.7 Phrasing problem ............................................................
P29
3.8 EDT and RDC.......................................P33
3.9 Resources: Treebank and PKU People's Daily data ....................
4
M etho d s ................................................................................
P34
P 37
4.2 Sentence tagging ............................................................
P37
4.3 Segmentation ................................................................
P38
4.3.1
MMS......................................P38
4.3.2
Personal Names .................................................
P42
4.3.3
Heuristic Rule Based system .................................
P44
4.3.4
Unigram frequency model ....................................
P46
4.3.5
PKU OTS software .............................................
P47
4.3.6
Rule based Alembic segmentation module .................
P49
4.3.7
LISP Part of Speech Tagging .................................
P50
4.4 Phrasing system in Alembic and OTC software .........................
P52
4.4.1
T raining ..........................................................
P 53
4.4.2
Testing and scoring and error tracking .......................
P54
4.4.3
Compiling the frequency data .................................
P57
4.4.4
Generating wordlists to aid tagging sequences .............
P59
Author: Daniel Hu
MIT-EECS
Page 3
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
4.4.5
5
Manually Generated rules .....................................
Resu lts ..................................................................................
5.2 Results from phase 1 .........................................................
5/29/01
P60
P6 2
P62
5.2.1
Sentence tagging ................................................
P62
5.2.2
Segmentation ....................................................
P62
5.2.3
Part of speech tagging ..........................................
P63
5.3 Results from phase 2 ........................................................
5.3.1
Machine Learned Rules for NE Task ........................
5.3.2
Improvements with Addition of Custom Wordlists
and Manually Generated Rules for NE Task .........................
6 D iscussion ..............................................................................
P66
P66
P68
P 70
6.2 Separation of Phase 1 and Phase 2 .........................................
P70
6.3 Segmenation ..................................................................
P70
6.4 Post-process for OTS module ..............................................
P71
7
C onclusion ..............................................................................
P 72
8
Bibliography, references, and resources ............................................
P73
8.2 Web based resources .........................................................
P73
8.3 Academic papers .............................................................
P73
8.4 MIT RE reports ................................................................
P75
8.5 Operating Manuals and others ..............................................
P75
Appendix A .................................................................................
P 77
Al Sample sentence tagged output ..................................................
P77
A2 Sample segmentation output from MMS .....................................
P78
A3 Sample segmentation output from three word model ........................
P79
A4 Sample segmentation output from Unigram model ...........................
P81
A5 Sample segmentation output from PKU segmenter ...........................
P82
Appendix B .................................................................................
P 84
B 1 Sample spec file ...................................................................
P 84
B2 Sample learned rules file for phrasing ..........................................
P84
B3 Manually Written Rules for Organization Tagging ...........................
P87
B4 MMS Segmenter ...................................................................
P89
Author: Daniel Hu
MIT-EECS
Page 4
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
B5 Translating wordlists from Web based dictionary ...........................
Appendix C ...............................................................................
5/29/01
P95
. P97
Cl Evaluation output of the trained phraser .......................................
P97
C2 Evaluation output of the manual phraser ......................................
P99
A ppendix D ................................................................................
P 100
D l Surnam e list ......................................................................
P 100
D2 Foreign phonetic translations characters .....................................
P101
D3 Organization and locations suffix .............................................
P102
A ppendix E .................................................................................
El University of Pennsylvania Chinese Treebank project ......................
P 103
P103
E2 Beijing University Institute of Computational Linguistics
People's D aily project ...........................................................
P 104
E3 MET 2 and BBN data ............................................................
P105
E4 Frequency data from PKU People's Daily project ...........................
P107
E5 Frequency data from U. of Penn. Treebank project .........................
P108
E6 Sample organization part-of-speech frequency results from
People's D aily ....................................................................
P 109
E7 Phrase Tagged People's Daily Data (using Manually generated rules) ...
P109
E8 Correctly tagged People's Daily data ..........................................
P110
Author: Daniel Hu
MIT-EECS
Page 5
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
1. Organization of Thesis
This thesis covers a broad range of topics. The introduction gives an overview of the
system that we plan to develop. The background section sets up the problems that we
face when building the system. This includes a more detailed description of the software
system, the guidelines for the task that we are trying to perform, and the unique problems
when dealing with Chinese. After the problems are presented, the methods section will
provide the solutions we have envisioned. It will include information on how we utilize
different resources, and different ways we can solve the same problem. The results
section will present a summary of outputs from the various process schemes described in
the method section. The discussion section following the results section will talk about
some of the unresolved issues in our system. Finally, we will conclude with a very short
summary of the discoveries.
Author: Daniel Hu
MIT-EECS
Page 6
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
2. Introduction
The accumulation of information in this electronic age is astonishing. Yet we have
very little intelligent tools that will help individuals manage this goliath and help make
critical decisions. Natural language processing research communities are looking closely
at this problem and are attempting to build intelligent software that can "read" and
"understand" as a person. This research is not only limited to English, however. As
information in other languages multiplies at an even higher rate on the web, more
attention has being placed on building a multi-lingual information-processing system.
One such language of interest is Chinese. Used by a fifth of the world's population,
Chinese language is playing an ever-expanding role on the World Wide Web for
information exchange. Therefore, Chinese language processing is of great interest to
government sponsors and people in the research community. Past efforts at processing
Chinese have been undertaken by numerous research institutions and company. The
primary focus of those researches has separately been segmentation, part of speech
tagging, transliteration, and parallel corpus computing. Those researches delved deeply
into the finite points of Chinese language, and addressed many important linguistic and
computational problems. This project has a different goal. We are aimed at building a
complete end to end system that is capable of doing Entity Detection and Tracking, and
Relation Detection and Characterization, while utilizing the most advanced methods and
results from other independent Chinese NLP efforts. The project is named ACEC, which
stand for Automatic Content Extraction for Chinese.
In this introduction chapter, we hope to address three general topics. First we will
introduce the Alembic processing environment. Then we will address the three phases of
the project. Then we will briefly talk about the existing studies and language resources
for Chinese language. Finally we will talk about our resource limitations.
Author: Daniel Hu
MIT-EECS
Page 7
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
2.2 Building Chinese Text Processing System Based on Alembic System
This end-to-end Chinese processing system will be based on an English processing
system called the Alembic.
"Alembic software project for English arose out of the congruence of a need and of a
technology. As information systems become increasingly ubiquitous in the worlds of
work, government, and even the home, so grows the need for improving the acuity of
information systems.
The principal value of Alembic system is that it provides fast and accurate models of
language to identify aspects of a text that are of general interest to information system
users. For instance, Alembic can identify proper names and classify them as to whether
they designate people, places, organizations, and others. In addition, Alembic can be
tailored to search for particular classes of events and to report them in a database format."
I
The technical approach underlying Alembic is a new class of trainable language
processing systems, rule sequence processors. These systems can either be configured to
a new task by hand, or they can be configured automatically by providing a learning
procedure with a collection of examples. In the latter case, the learning procedure
automatically generalizes the examples into the same kind of linguistic model a human
engineer would create by hand. This allows for unprecedented ease of development, since
Alembic can be configured to an entirely new task by non-expert users who only need to
create the collection of examples. This collection can be readily and rapidly authored
with the Alembic Workbench 2 .
1 Description
2Summarized
of the Alembic System Used for MUC-6, MITRE 97
from http://www.mitre.org/resources/centers/it/g063/alembic.html
Author: Daniel Hu
MIT-EECS
Page 8
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
In addition to trainability, the rule sequence processors underlying Alembic are
extremely fast, and confer on the system one of the highest processing throughput rates in
the literature.3
The technical benefits of Alembic stated above can be adapted to a multilingual
system. In fact, the rule based system and training capability make it easy to be adapted
for other languages quickly and at a low cost. With minimum adjustment to the modules
of the software system, and the additions of new modules, languages such as Spanish and
Chinese can be added to the language capability.
The end goal is to have a Chinese text processing system that can achieve similar
results as the English processing system. More details about Alembic processing system
and how it is modified will be presented in the background and the methods section.
2.3 Three Phases of the Project
There are roughly three phases to the overall ACEC project. The first phase is
exploration of text parsing and part-of-speech tagging on Chinese. The second phase is
exploration of named entity tagging, and nominal entity phrasing. The third phase is
Entity Detection and Tracking, and Relations Detection and Characterization. This thesis
paper primarily focuses on first and second phase of this process.
Alembic system is a highly adaptable system of rule based learning machine. But to
achieve the high quality results, a preprocessing module need to be added first to address
the fundamental differences between Chinese and English. First, Chinese does not have
overt word boundaries as English. Secondly, Chinese doesn't use inflection to mark verb
tenses or part of speech. Thirdly, some Chinese expressions do not have direct counters
in English, making direct mapping of English rules to Chinese impossible. These
differences can be addressed through an initial module that performs several functions: 1,
segmentation, and 2, parsing of sentences into meaningful word chunks, and 3,
3Quoted
from http://www.mitre.org/resources/centers/it/gO063/alembic.html
Author: Daniel Hu
MIT-EECS
Page 9
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
importation of a new set of tags that could address difference in semantics. This initial
module could then adapt Chinese text to a familiar English format and the rest of the
processing may be done with more familiar methods. More details about the different
segmentation and part-of-speech systems will be presented in the background and
methods section.
The named entity tagging and nominal entity tagging performed on the second phase
of the project will use much of the pre-constructed algorithms in Alembic language
processing environment, namely, the training capabilities to do NE phrasing, and the
flexible rules environment to construct manual syntactic rules. We will also be able to
add wordlists and lexicon to this array of processing tools.
This thesis will not deal closely with EDT and RDC, which is the third phase of
ACEC project. However, the design considerations for earlier systems are closely tied
with requirements of EDT and RDC. Such design considerations are explicitly noted.
The discussion section of this paper will also talk briefly about initial explorations of
EDT and RDC, but we do not have results to back up the our theories or assertions.
Phase I
Sentence parsing
Segmentation
Part of Speech tagging
Phase II
Named Entity Tagging
Lexicon resource generation
Template building
Phase III*
EDT and RDC
Inference
Table 1 Phase I and Phase II are the focus of this thesis project. Phase III is explored but will not be
part of the results
2.4 Previous Studies
Chinese is such a commonly used language in the world that prior study in this
language is extensive. We have done extensive research into the areas that are relevant to
this project. The sponsor's guideline explicitly encouraged the use of off-the-shelf (OTS)
software whenever those are available and meet the performance specifications. Thus, a
big part of this project is identifying such resources, evaluating the performance, and
integrating acceptable outside resources to our system and putting together the separate
Author: Daniel Hu
MIT-EECS
Page 10
5/29/01
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
pieces that is the state of the art in their own respective fields. The most important criteria
for evaluation and acceptance of the outside resources are the accuracy of segmentation
and part of speech tagging, and the lexicon coverage. Named Entity tagging has not been
a major focus of prior research for Chinese. Thus technology for NE task stem from the
NE tagging capabilities of Alembic English processing system.
A wealth of information exists for tokenizing Chinese, chunking an otherwise
contentious string of character with no white space in between into meaningful units of
words. We have found good versions of segmentation software from University of New
Mexico, University of Illinois, and some online sites 4 . These software include Erik
Peterson's implementation of the Maximum Matching String Algorithm5 , Chih-Hao
Tsai's implementation of 3 word look ahead model 6 . There are also articles on statistical
methods 7, heuristic rule based methods8 , Chinese name identification methods9 , foreign
name identification methods' 0 , and segmentation methods that uses part of speech
information and Viterbi algorithm".
The first phase of the project involves putting together the wealth of information
already produce by other people to complete the segmentation and part of speech task.
We are attempting to emulate some of those methods described by the research articles
and adapted downloadable codes into our system. The codes that are directly used are
segmentation codes. Some of the important methods we explored are maximum
matching, three word look ahead, and unigram frequency model. Adapted into these
codes are important Chinese and foreign name identification models that included
4 See reference for web based materials
5
http://www.MandarinTools.com
6 http://www.geocities.com/hao510/mmseg/
is the location of the code. Algorithm lifted from 1992 paper
by Chen and Liu.
7 Richard Sproat and Nancy Chang, 1994
8 Chen and Liu, 1992 COLING
9 Large Corpus Based Methods for Chinese personal Name Recognition, Journalof Chinese Information
Processing, 6(3):7-15
10 Large Corpus Based Methods for Chinese personal Name Recognition, Journal of Chinese Information
Processing, 6(3):7-15
" David D. Palmer and John D. Burger, Chinese Word Segmentation and Information Retrieval, 1997
Author: Daniel Hu
MIT-EECS
Page 11
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
valuable comprehensive wordlists on Chinese surnames and etc. These wordlists are
build from past research efforts and improvements with the current pass.
Aside from the segmentation resources we collected or developed, there is also handannotated data critical to this whole project. As stressed earlier, one of the chief benefits
of Alembic system is it's learning capabilities. Machine training requires large amount of
manually corrected data. Due to limited time and human resources, manual tagging our
own data is impossible. The next best thing is to use data from outside resources. Some
of the data that are used in this project come from University of Pennsylvania Treebank
project12 and Beijing University ICL's People's Daily data13 . Their segmentation
guidelines are not quite the same. Nor are their part of speech tag sets. More details
about these two corpuses and some other segmented data sets will be provided in the
backgrounds section. We have also attached sample files from these important corpus in
the Appendix, and tables that provides definition to different tag sets.
MITRE also previously studied processing Chinese language' 5 . This was done
without the support of staff with Chinese language background. Thus, all the
performance levels were achieved by machine learning. There were resources that were
developed in conjunction with the previous effort. Those include derived wordlists of
national names, geographical locations, foreign names, and world capitals' 6 . Those
wordlists are of limited use, because editing was not possible due to lack of personnel
with Chinese knowledge. During this second initiative at processing Chinese, those
resources are cleaned up and reorganized. In additions, English wordlists were used to
build a parallel corpus in Chinese with web mining software developed for this project.
There are large numbers of unnamed resources for this project. They are mentioned
either in the thesis or listed under bibliography information. But even with careful
accounting, there is going to be missing pieces. We will attempt to do a through job in
12 Appendix
El
13 Appendix E2
1 Table 4
15 David D. Palmer and John D. Burger, Chinese Word Segmentation and Information Retrieval, 1997
16 Figure 11
Author: Daniel Hu
MIT-EECS
Page 12
5/29/01
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
explaining the most important parts to our system and how each of those parts plays an
irreplaceable role. However, we apologize in advance for any thing missing or
misrepresented.
2.5 Resource Challenges
Finally, we would like to highlight some of the limitation of this project before we
finish introduction and go into further detail. The first limitation is time. This project
was initiated on February 2001. The time that the thesis is being written is May 2001. In
three month, we completed initial exploration into phase I and phase II of the end-to-end
system, and is delving into phase III. The results are good, but are still far from the
numbers of the parallel English system. Better results are almost guaranteed with better
wordlists, more training, and more detailed rules. But the necessary hours needed for
those activities just do not exist.
Secondly, funding for this project has not being 100%. Some of the obviously
superior resources have not materialized due to stalls in the licensing department. We
can only complete analysis on parts where we can generate output, and fake the input into
the later modules that requires the missing link. This is unfortunate because the numbers
are bond to change when real modules come to replace the imaginary one. We will take
that into account when writing the results and provide further projections in the
discussion section.
Lastly, we are working with limited linguistic resources. In a perfect world, we will
hand generate training data with the best part-of-speech and segmentation. This will
optimize the capabilities of the named entity detection rules and provide the best
foundation for future EDT and RDC. But this is a time consuming and frustrating
process. Thus, a compromise is reached by using less-than-perfect resources from other
institutions and piecing them together in a patch-work of training data and derived
wordlists.
Author: Daniel Hu
MIT-EECS
Page 13
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
Even with the limitations mentioned above, our exploration for the first and second
phase of building a Chinese text processing system has been through and detailed. This
report will be useful to others carrying out the rest of ACE Chinese project, both as a
reference to all the available resources and as a general guideline.
Author: Daniel Hu
MIT-EECS
Page 14
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
3.
5/29/01
Background
The background section defines the problems we are facing when building a Chinese
text processing system. The first section of the background will describe in more detail
the Alembic system mentioned in the introduction. The second section will follow with
proposed design for Chinese processing systems. The third section will define the
Chinese segmentation problem in more detail and highlight the different guidelines
towards this problem. The fourth section talks about the tagging structure used. The fifth
section describes the part-of-speech tagging problem and ambiguities. We will also
provide some examples of how Chinese is different from English. An important table of
different part of speech tag sets will be provided. Closely related to part-of-speech
tagging is the phrasing problem. This is described in the fifth section. In this large
section we will also provide guidelines to named entity tagging. The sixth section
provides guidelines to EDT and RDC as defined by ACE groups. Here we will also
detail some of the processing steps needed to do EDT and RDC in the English system. In
the seventh section more will be said about two of the important annotated corpuses:
Penn treebank, and PKU People's Daily.
3.2 Alembic Processing System'7
The Alembic system divides roughly into three general sections, a C and UNIX based
pre-processor, also called the pre-alembic, a LISP-based phraser and syntactic analyzer,
and a LISP based inference system. But as we shall see, the boundaries are not as clear
once we enter the domain of LISP system, as the flow and accumulation of data is done
from all levels and it can not be modeled as a straight series of one input, one output
modules.
For the English language, the pre-processor, also called by the name pre-alembic, is a
patch work of code that dates back to the earliest text processing efforts at MITRE. The
preprocessor divides into two units. First stop of pre-alembic is called zoning. Often the
17 Description of the Alembic System Used for MUC-6, 1997
Author: Daniel Hu
MIT-EECS
Page 15
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
input into language processing software is not clean, coming from various different
sources, with various different tags and errors. Zoning is the process which strips the
document of symbols that might cause errors. Then it will pass the document onto the
sentence/lexicon detection module. This module will identify sentence boundaries, and
place <LEX></LEX> tags around words and end-of-sentence punctuations. The LEX
tags would include code for part of speech information about the words. Such as "noun"
or "verb" or "adjective". The output of the pre-alembic system is a clean, marked up
piece of document, which is ready to be further processed by the LISP system.
The building block for LISP processing environment are lists of rules. Each rule
implicitly contains a simple linguistic property. For example, we might have a rule that
equivocally says: "if a noun follows an adjective, that adjective goes to modifying this
noun." This rule might apply 90% of the times. For the 10% of the exception, other
simple, straightforward rules will attempt to patch up the errors. The list of rules for each
module is applied serially to perform the tagging required of that list, and every rule is
applied only once. For a complete phrasing module, there will typically be one to two
hundred rules. The tagging specified by the most recently applied rules overwrites
whatever tagging has been done before. Thus the general trend in writing rules is to go
from broad to narrow. The benefits of this rule-based system include speed, flexibility,
and machine learning capabilities. Because each rule is almost like a simple observation,
a computer with a large training data can generate and re-arrange the rules until it is able
to get very high scores on the training corpus.
The modules for the LISP phraser start with the name entity tagger. This system is
responsible for tagging all personal names, locations, organizations, geo-social-political
groups (GSP), and governments. Then the tagged data is passed to the phraser module,
which is responsible for expanding the nominal entities into nominal phrases. For
example, the following phrase, "SenatorKerry of Massachusetts", would first be tagged
with "<post>Senator</post> <name>Kerry</name> of <location>Massachusetts
</location>" in the named entity tagger, this sequence would be expanded into
"<person><post>Senator</post> <name>Kerry</name> of <location>Massachusetts
Author: Daniel Hu
MIT-EECS
Page 16
.
Pre-Alembic
Webpages
News wires
etc.
CD
1.
-
Sentence / lexicon detection
Zoning (cleaning up)
.
.. . -
...................
..............
........................
. ....
-L-I
LEXICON
.
...............
*.....
Proposed Chinese processing
Module which combine segmentatior
.
named entity tagging, part-of-speech
II
and phrasing.
:Part of speech tagging
: i
:
-- Ph Rules
rz
F.
Name entity tagging
--- - - - - - - - -- --- - - - - - - - - - -
Wordlists
H
C-.'
Chunking, phrase finder
EDT
RDC
DSE
...................................
................
LEXICON
-
LISP-based inference
Grammatical relations
---
Ct
--
Coreference
Domain
____I________
0
5/29/01
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
</location></person>" in the phraser module. More details about phrasing will be
provide in later section of this chapter.
The tagged data would then be passed on to the grammatical relations module. Like
the name implies, this module is responsible for establishing the relationship between the
nominal and verbal phrases.
The final part of the LISP processing system is the co-reference module. After the
grammatical relationships have been established in each sentence, this module will
attempt to connect the nominal phrases.
All of the main LISP modules feed into a temporary data bank similar to a template.
As more information is gathered, more tokens and ties are added into the data lists.
Higher level processing such as Entity Detection and Tracking, and summarization will
use this database to complete the job.
3.3 Chinese Processing System
A different schematic than the one for English system is needed for processing
Chinese. The problems unique to Chinese that will not be resolved using existing
framework is segmentation and part-of-speech tagging. More details about the problem
of segmentation and part-of-speech tagging will be provided in sections 3.4 and 3.5. In
this section, we will talk about some possible system solutions, and introduce the
resources that we are exploring. To summarize our needs, a component needs to be
created that will take the input from the zoning module and output segmented, part-ofspeech tagged, and sentence tagged text. This data will then be passed on to the LISP
phraser and inference module built into Alembic, and the rest is similar to the English
system. This data flow illustrated in figure 2 and in the example below.
W
IM VL
Input: H RW VAk E~H n r
(Our nation is currently moving from subsistence farming to commercial farming)
0utput:<S><t> 4</t><n>RM </n><n>}, </n><d>E i</d><ppos> </ppos><vn
>n ',</vn><n>-V R </n><v>-Vr-J</v><n>iM e</n><n>-V R </n><w>,, </w></S>
Author: Daniel Hu
MIT-EECS
Page 18
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
Zoning
BLACK BOX that
performs:
I Sentence tagging
II. Segmentation
III. Part of Speech
tagging
5/29/01
Phrasing
Inference
Etc.
Figure 2
In this input/output box, we can treat segmentation and part-of-speech tagging
separately, or we can treat them together. The argument for treating them separately is
increased processing speed and ease of development. On the other hand, part-of-speech
information may aid in clarifying segmentation ambiguities. The desired solution would
be to build and test both systems. But there are resource limitations. To build a
combined system from scratch, we needed a very good part-of-speech dictionary, which
is unavailable. So by default, the system developed and tested in-house all treated partof-speech and segmentation separately. The first stage is the segmentation. Then the
output from the segmenter is fed into a LISP based part-of-speech tagger using a very
limited 50,000 word part-of-speech lexicon.
Segmentation of Chinese sentences is always treated one of two ways. There is the
purely statistical approach, and then there is the lexicon approach18 . Past researches have
generally shown that lexical approach is more effective and requires less processing
power. Thus, all segmentation paths that we are exploring fit under the lexical approach
category. Chinese is read from left to right, and majority of the segments can be
determined by a technique called maximum matching string (MMS)'9. Existing code for
this method can be downloaded from www.MandarinTools.com, developed by Peter
Erikson. This is our baseline. Other techniques that go beyond this simple algorithm are
maximum matching string with 3 word look ahead (MMS3) developed by Tsai 2o based
on Chen and Liu's 1992 paper2 1 , and 3 word look ahead model that take advantage of
frequency data that is modified from the MMS model. The following figure illustrates
the data flow diagram.
18 Richard Sproat and Nancy Chang, 1994
19
20
21
http://www.MandarinTools.com
http://www.geocities.com/hao510/mmseg/
Chen and Liu, 1992
Author: Daniel Hu
MIT-EECS
Page 19
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
MMS (PERL)
C and PERL
-
FLISP
based sentence
segmenter
based
or MMS3 (C)
P-O-S
17tagging
or Unigram PERL
IPOS worls
Figure 3 The three different modules that can do segmentation combined with a LISP based part-ofspeech tagging module.
The alternative to treating segmentation and part-of-speech tagging apart is to build
them into a single module. This method is effectively exploited by an off-the-shelf
(OTS) software from Beijing University Institute of Computational Linguistics (PKU
ICL)22 . The output of this system is not in the format that we can use in the Alembic
processing environment. So post-processing tools are added to the data path to further
expand the tags and create the desired data format.
Beijing University
Segmentation and
Sentence
tagging,
POS software
conversion,
correction
Figure 4 Installing Beijing University segmentation and POS software into Alembic.
With the correct segmentation and part-of-speech information, the data is then fed
into the Alembic rule based modules. The first pass is the named entities tagger just like
the English system. Then it is passed on to the phraser, which expands the named entity
tags into nominal phrases. All the while information is fed into the template. The
exploration in this project is limited to finding the optimal solution to segmentation and
part-of-speech tagging, and named entity task.
22
http://www.icl.pku.edu.cn/nlp-tools/segtagtest.htm
Author: Daniel Hu
MIT-EECS
Page 20
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
3.4 Segmentation Problem in Chinese
Predicating the understanding of processing techniques for Chinese, we first need to
establish the basics of Chinese grammar and word. While Chinese have a rich tradition
of phonology, grammar studies are fairly recent on the historical scale. The original
studies on the Chinese grammar were done as the result of Western language influences.
The concept of a word is intuitive to the native speaker, but an effort to establish a basic
guideline for singling out a word and giving it a grammatical role was not undertaken
until late
19 th
century after the introduction of Western linguistic concepts. So
grammatical concept in Chinese is not completely alien to the grammatical concept of
romantic languages. Yet, there are great many instances of ambiguity when we attempt
to dissect Chinese language using the basic linguistic rules provided for English. Thus
arose the need to import a different set of guidelines for handling Chinese.
The first difficulty in analyzing Chinese is segmentation and definition of a word.
The most basic unit of the Chinese language is a morpheme 23 . A morpheme is one
Chinese character. On a computer, Chinese morpheme can be expressed in GB and BIG5
encoding, where each character is made up of 8 bytes. In Unicode, this character may be
encoded in either 16 or 32 bytes depending on which version of Unicode. There are
bounded morphemes and free morphemes. Free morphemes are characters that retain an
independent meaning and can appear on their own. Bounded morphemes are those that
do not appear independently. An example of a free morpheme is A (gou - dog). While
this character often appear on it's own, it can also enter into polysyllable combination
with other morphemes to form words, such as iJ(mugou - bitch), or M
(guoxiong -
bear). Absolute bounded morphemes are much more rare in Chinese. An example of the
bounded morpheme is
V (jili - caltrop, a type of plant), where the first character M
(ji) has no meaning independent from the second character. A sentence in Chinese is
formed of a string of morphemes without space to separate them. The reader supplies
23
Jerry Norman, Chinese, 1998
Author: Daniel Hu
MIT-EECS
Page 21
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
intuitive boundary to the words and construct them into logical sequences. Classical
Chinese literature would mainly use one morpheme unit as the basic word, such as the
writings of traditional courtiers of imperial China. But preceding the republic era,
Chinese scholars and writers began to adopt bahua, or a written form of the spoken
language. Bahua contained many more polysyllable words. Slight variation on this
historical language development survive today as the official language of China, called
Mandarin. Mandarin possess many polysyllable words, yet, the written form of
Mandarin follow the tradition of no space boundary between words24 . This poses a
special problem to the linguistic computation, which bases much of the algorithm on the
fundamental unit of a word.
The most encompassing law for segmentation is "stringsof characterswhich are the
smallest unit of meaningful sequences that plays the same role in or out of context". We
will attempt to apply this rule whenever we can towards the problems of segmentation of
Chinese. But there are many cases where this rules is applied with a lot of ambiguity.
For instance, the sentence
(zhong guo lian xiang gong si de fu cong jing li fang wen pu dong te qu - China
Liangxiang company's deputy general manager visits Pudong special economic region)
contains
7IjJT (fu cong jing li - deputy general manager) and (
(special
economic region). If we apply the strictest sense of the smallest unit of meaning rule, we
will get MIJ_3T (deputy-general manager) and 1AK(special-region). Yet, most
linguist would agree with the first case of segmentation, but would disagree with the
second case of segmentation. This is because of a second law. Which states that "bound
morphemes together when, if treated independently, their meaning are highly adaptable".
In this example, J [
24
are treated as one word even though they are both free morphemes
Jerry Norman, Chinese, 1998
Author: Daniel Hu
MIT-EECS
Page 22
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
because the [K is almost always presided by some kind modifier and
5/29/01
4$ easily form
polysyllable words. In another word, both morphemes are very versatile2 5 . But this
introduces a lot of confusion because the versatility of a morpheme is a highly subjective
topic, and different linguist would have different ideas of how to segment the same
sentence. Often, two annotators will have 10% disagreement over the same corpus.
For the purpose of our project, we have not adopted any specific guideline. The rule
of thumb will still be "smallest unit of meaning". In any case, the specifics of
segmentation have more to do with easy of information extraction in the down stream
processing modules. The word unit has to be small enough such that 1. accurate part-ofspeech information can be assigned and 2. we will not have to take words apart in later
modules to do EDT and RDC. An example would be comparing MIj42K (fuxiaozhang vice chancellor) with
5i1]$-
t3-K
(xiaozhang - chancellor). We most definitely would want
(vice chancellor) displayed as 11J$-K(vicechancellor) because it would be
easier to establish relationship between vice chancellor and chancellor in downstream
processing, instead of treating them as completely independent positions.
The next immediate problem after determining what constitutes a word segment is
how to actually extract this information out of a sentence. A native reader of Chinese
will be able to determine the correct segmentation meaning from context.
The corpus that we will be primarily working with will be official Chinese newswires
and certain legal documents. These texts will be written in putonghua, or "common
language". Minus some special articles and summary reports, the language used in these
articles would follow journalistic tone, giving us ample opportunity to come up with
common rules. In addition, the polysyllable words will be easily found in the dictionary.
Two sources of the journalistic text are explored. One is the Treebank project annotation
25
Jerry Norman, Chinese, 1998
Author: Daniel Hu
MIT-EECS
Page 23
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
of Xinhua news agency articles, and the other one is PKU ICL annotation of People's
Daily articles 2 6 . Frequency of word length is presented in the following table.
Total words
Xinhua News Agency
82,148
People's Daily
907,017
1 morpheme
2 morpheme
3 morpheme
4 morpheme
5 morpheme
27,187
47,216
6,330
857
468
337,724
492,879
51,116
19,375
5059
33.1%
57.5%
7.7%
1.0%
0.6%
37.2%
54.3%
5.6%
2.1%
0.6%
Table 2 Word length distribution of Xinhua News Agency and People's Daily. Comparison of
segmentation results from U. of Penn. Treebank project and PKU ICL annotation.
26
See Appendix El, E2
Author: Daniel Hu
MIT-EECS
Page 24
5/29/01
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
3.5 Guideline for Tagging Documents
Before introducing the part of speech tagging and phrasing problems, we will
introduce the SGML tagging guidelines. For start, every article needs to have a correct
set of HTML tags. Every processing module looks for <DOC> and <TXT> tags to begin
processing. Instead of putting everything into words, a table is provided below to
illustrate all the tagging needs.
Module output
Newswire data
Tags
Example
(dirty) HTML
<DOC>
missing or miss-
<TEXT>
labeled tags
p<>
</DOB>
Zoning
(clean) HTML
<DOC>
<TEXT>
<P> Mihfwlm4.
</P>
</TEXT>
</DOC>
Sentence
<S></S>
<DOC>
<TEXT>
Segmentation
<P>S>$$iha]
$.</S></P>
</TEXT>
</DOC>
Segmentation
<LEX POS="v">
<DOC>
and P.O.S.
<fLEX>
<TEXT>
<P><S><LEX POS="nr"> 94-M-</LEX><LEX
tagging
POS="v"> ijwf</LEX><LEX POS="nt">
M$L</LEX><LEX POS="w">. </LEX></S></P>
</TEXT>
</DOC>
Named Entity
<ENAMEX TYPE
Task
="ORG"></ENAMEX>
<DOC>
<TEXT>
<P><S><ENAMEX TYPE="PERSON"><LEX
POS="nr"> 94EU-</LEX></ENAMEX><LEX POS="v">
tj
)5]</LEX><ENAMEX
TYPE="LOCATION"><LEX
POS="nt"> ltJ</LEX></ENAMEX><LEX POS="w">
</LEX></S></P>
</TEXT>
</DOC>
Table 3
Author: Daniel Hu
MIT-EECS
Page 25
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
3.6 Part of Speech Tagging Problem for Chinese
Grammar is a direct result of the identification of a word. Sequences of words placed
together in coherent and consistent way must be linked together through the concept of
part-of-speech, because each word has a role of gluing the other words together or taking
an active part in the meaning of the sentence. This concept is as Western to the Chinese
language as is the concept of a word. But it has also been adapted to fit under the
framework of Chinese language. In today's linguistic research community the major
word classes (part of speech) enjoy fairly high degree of agreement. The scheme
outlined below is close to that found in several widely used grammars.
There are six full word categories: nouns, verbs, adjectives, numerals, measures, and
pronouns.2
A noun is something that can be directly counted with a numeral: --V
V (buqu (sanliangche - 3 cars). A verb is a word that can be negated by T(bu - not): T
not going). Adjectives are generally considered to be a relative to the verb. They can
also be negated by T(bu - not) and can function independently as predicates. Numerals
include classes of Chinese numbers and estimations of quantities, in addition to roman
numerals. Measures are words that indicate unit for each noun. In the previous noun
example, --E Mf (sanliangche - three cars) is actually formed of
(san-three), &M (liang
- measurement for vehicles), and V (che - car). Measures have similar function as
"sheets" in "three sheets of paper" in English, but they are much more prevalent.
Pronouns are substitute for a person or things. Syntactically they generally behave like
nouns, but they do not admit modification. The third person pronoun is gender specific
in certain writing, and unspecific in others. They are ft!, 4t,
'E (ta - he, she, it). Other
pronouns such as demonstrative pronouns include (LX - zhe, this) and (E - na, that).
27
Jerry Norman, Chinese, 1998
Author: Daniel Hu
MIT-EECS
Page 26
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
Other words that commonly included with the pronouns are (i-h
-
5/29/01
shengme, what),
(ME - self), and (t - every).
The functional word classes are adverbs, prepositions, conjunctions, particles,
interjections, and onomatopoeia.28 Adverbs modify verbs and adjectives, and are usually
bounded. Chinese prepositions originate from verbs, and are used to show various
grammatical relationships. Conjunctions form different kind of connections: fn (he and), 1ThJJ (erqie - moreover), MJ (yinwei - because),
j.J)f(suoyi - therefore).
Particles are generally monosyllabic words used to show different grammatical
relationships or subjective and modal overtones: W,] (de), 1E (ba). Interjections functions
as asyntactic forms used to express warnings. Phonologically they contain irregularities
such as elements that are not part of the regular phonemic inventory. Onomatopoeia are
words which mimic the sounds of things. Syntactically they behave much like adverbs or
nouns.
As in English, there are considerable overlaps between the different classes. One
word or morpheme may belong in several categories. An example is JtIv'F (piping criticize), which in this case is a verb. But we can use the same ILLW as a noun which
means criticism. Below are two sentences that uses the same word with different part-ofspeech:
R ffTAiA
f]M@
4IkW
iT7JO
(We should criticize this kind of behavior)
lkw9 fPR-EfO (Your criticism is justified)
This kind of overlap is a lot more prominent in classical Chinese literature. But in
today's written language, it is no more prevalent than that of English. Thinking back
towards the segmentation and part of speech tagging diagrams, we will have to build post
28
Jerry Norman, Chinese, 1998
Author: Daniel Hu
MIT-EECS
Page 27
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
processing modules for any imported software to make the necessary corrections. In the
example above, we can write a rule that says that any segment tagged as verb would have
the tag changed to noun if it came after the character W'J (de).
This is just a brief summary of all the possible part-of-speech designations in
Chinese. There are almost always subdivisions and new categories created when we deal
with different linguistic camps. Two main groups we have looked at are the University
of Pennsylvania's Chinese studies department and Beijing University's ICL. University
of Pennsylvania developed their part of speech tag set based on the requirements of the
Chinese treebank project, an effort to completely parse Chinese sentences. PKU ICL
developed their part-of-speech tag set mainly to aid computational linguistics. The tags
from Beijing University are presented below. These are more important to our system
because majority of the training was done based on corpus with these POS tags.
Author: Daniel Hu
MIT-EECS
Page 28
5/29/01
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
PKU POS
Definition
na
Mg
vd
nr
a
ns
nt
b
c
d
vn
e
ad
f
h
i
1
Noun used as adjective
Numeral used as a noun
Verb used as adverb
Personal Name
Adjective
Nominal location
Nominal organization
Negative
Conjuction
Adverb
Verb that can also be a noun
Another form of interjection
Adverbial adjective
Location
Head
Idiom
Transient word
n
Noun
an
o
p
q
r
s
t
u
v
y
Adjective that can also be a noun
Onomatopoeia
Prepositional
Quantity measure
Pronoun
Space
Time
Auxiliary
Verb
Tone
Figure 5 A partial list of PKU POS tags.
3.7 Phrasing Problem 29
Traditionally the phrasing problem for Chinese is viewed as two separate tasks. The
first pass is the Named Entity Detection task as defined under the Multilingual Entity
Task guidelines. The second pass is Entity Detection as defined under the EDT and RDC
guidelines. The difference between those two tasks is coverage. The NE task has a more
narrow focus, aimed at identifying unambiguous entities in any corpus while ignoring
unspecific mentions. EDT and RDC task guideline has a broad focus. The end goal of
EDT and RDC entity task is to find specific mentions (as specified in the NE task), and
expand their association to unspecific mentions. So the EDT and RDC phrasing
EDT Metonymy Annotation Guidelines Version 2.1 (2001), Multilingual Entity Task (MET) Definition
for Chinese (1996)
29
Author: Daniel Hu
MIT-EECS
Page 29
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
guideline would cover specific mentions as well as some reference mentions. There are
two different scoring processes for the NE task and EDT. They both take machine
generated phrase tags and compare it to hand annotated phrase tags, and calculate a
percentage called the f-measure (more about f-measure will be presented in the methods
section, 4.5). The hand annotated phrases need to abide by the different task guidelines
depending on which task we are trying to accomplish.
Under the sponsor's contract, our end goal is EDT and RDC. Traditional NE task is
of less importance to the whole process. So our first phrasing pass is a hybrid of the
traditional NE task and the complete EDT named entity detection task. In this design, a
more inclusive second pass will make corrections on the first pass, and complete the EDT
named entity phrasing task. This hybrid NE task is similar to the traditional NE task in
that it only allows single NE tag around any phrase. The markup is also based on the
<sgml> tag set used for NE. Traditional NE task can be broken down to three
subcategories. These subtasks are responsible for capturing entity names, temporal
expressions, and numeral expression, respectively. Entity names are mentions of
organization, persons, and locations. We will present some of the entity tagging in more
detail in the following chapters. But first we will take a closer look at the simpler task of
marking up temporal and numeral expressions.
Temporal expressions are for "absolute" and "relative" time expressions3 O. In the 6t'
Message Understanding Conference, the NE tagging guidelines for English only
specified the tagging of absolute time expressions. This standard is relaxed and expanded
for Chinese to include relative time expressions 31. Absolute time expressions include the
any specific mentions of time, date, and duration that are independent from the document.
Relative time expressions include all other specific mentions of time such as "yesterday",
"last month", "in a few hours". Just like English, time expressions in Chinese have
exclusive units that can be anchored to achieve high performance relatively easily3 2. The
tagging process will be presented briefly in the methods section. But it will not be part of
30
31
Multilingual Entity Task (MET) Definition for Chinese, 1996
Multilingual Entity Task (MET) Definition for Chinese, 1996
32 http://www.MandarinTools.com
Author: Daniel Hu
MIT-EECS
Page 30
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
the result because detailed analysis for the relatively simple task is not needed. Numeral
expressions are mentions of any monetary amount or percentages. Similar to the
temporal expression task, there are clear anchors to be used when dealing with these
phrases. Rules which are used to tag these expressions will be briefly dealt with in the
methods sections.
The more difficult task is the entity task. The uniqueness of Chinese means that
much of the concepts which are used to handle the English entity detection task can not
be applied. Chinese entities types have to be examined independently from their English
counter parts. The three subcategories for entity in Chinese are persons, locations, and
organizations. The hybrid technique that we have adopted for the first phrasing pass is
also responsible for tagging some of a fourth category, namely Geo-Political-Entity
(referred to as GPE from now). We will examine the problems with each of these
categories separately.
As reader knows, Chinese do not have "capitalized" characters, so personal names
can not be picked out like their English counterparts. On top of that, this is also a
problem for segmentation because Chinese sentences contain no spatial separation.
Political entity and locations might be resolved on the segmentation level with a good
inclusive list, but personal names are a completely different story. There is an unlimited
number of personal names possible, and these include all the Chinese names as well as
translations of English names. So the segmentation of Chinese and English personal
names have to be solved at the segmentation level. There are rules to forming a Chinese
name, though, and a detailed description of how these rules are exploited is provided in
the method section. To give a light overview: the method takes advantage of the fact that
Chinese surnames come from a finite list, and given name is usually no more than two
characters long.
This method does not work for foreign names. Fortunately, foreign names are usually
sound translations that derive their phonetics from a finite set of Chinese characters. So
we have methods that work with this characteristic to tag foreign names. In summary, a
Author: Daniel Hu
MIT-EECS
Page 3 1
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
majority of personal names will be resolved at segmentation level before the LISP
system. The tags assigned to names at the segmentation level will be LEX tags with
personal name designated "part-of-speech" information. Based on the "part-of-speech"
information, phrase tags will be added around the personal names. At this stage of
tagging, titles will not be included as part of the person, nor will other entity modifiers.
Phrasing of locations and organization is done in the LISP based rule modules.
Unlike personal names, word components to locations and organization are already
segmented when they have reached the phraser. The phraser will need to identify all the
components that make up the organization or location and put them together into a
phrase.
The general rule of thumb is to put together things that are not separated by
possessive particles. M B W
9 ( meiguoguofangbu - American defense department)
should be treated as organization, where as M B_'j_
(meiguo-de-goufangbu
-
America's defense department) would tagged separately as location and organization 3 3 .
Or if a superior organization or location precedes the immediate mention of a subordinate
or representative organization, and they are not separated by a possessive, they should be
tagged as one34. An example of that would be rpiEJf S3I2 %
stationed in US): Chinese government ('t' B
(Chinese consulate
) is separated from consulate
M3t
by the verb Si (zhu - stationed), but since this is not a possessive verb, the primary
organization is combined with consulate to form a single organization. Similar rules
apply for location.
' BIL-
(zhongguobeijing - Beijing, China) is tagged as one
location because they are not separated by a possessive.
33 Multilingual Entity Task (MET) Definition for Chinese, 1996
34
Multilingual Entity Task (MET) Definition for Chinese, 1996
Author: Daniel Hu
MIT-EECS
Page 32
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
Generic mentions are in most cases not tagged. Generic mentions constitute
mentions of organization and location that is not identifiable out of context. Examples of
such mentions are
t (fandudang - opposition party) and '7i})
(bangongshi -
office). The MET2 guidelines for NE tagging in Chinese specified that US government
is treated as a generic mention and not tagged. This rule is relaxed when we built our
hybrid NE system. Any location linked with a government office is tagged as an
organization. The MET2 guidelines further specified that each word may only have one
NE tags. This rule is adhered to. Our tag set includes only ORGANIZATION,
LOCATION, and PERSON. Any word will only be wrapped with one of these tags.
Under EDT guidelines, locations that make reference to an Organization or broad
organization mentions are tagged as a GPE. US Government is one example of a GPE.
Another example of GPE is n
'i
(baigong - Whitehouse). In our NE task solution, we
handled part of the GPE tagging by tagging t 2 as a location, and tagging US
Government as an organization. This is far from complete, however, and much more
needs to be done before EDT entity detection task is satisfied.
3.8 EDT and RDC
We will now talk briefly about the EDT and RDC task for the benefit of discussion.
There are no results to show for EDT and RDC, but it is important to understand the final
goal of the project in order to appreciate the intermediate steps. EDT and RDC stand for
Entity Detection and Tracking, and Relations Detection and Characterization. The goal
of EDT is to identify a specific entity and track its activities or characteristic through
documents. The EDT task is actually a complex of different tasks. 3 5
1.
The detection of entities. This is the bases on which all other tasks are built. This
is for the detection of specific entities. The named entity phrasing module
35 EDT Metonymy Annotation Guidelines Version 2.1, 2001
Author: Daniel Hu
MIT-EECS
Page 33
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
completes part of the task, although another processing layer need to be added to
clarify between location, organization, and GPE.
2.
The recognition of entity attributes. This is partly integrated with the first part.
Distinctions need to be made as to the type of named entity. The tag set expand
slightly as more specific labels are assigned.
3.
The detection of entity mentions. This measures the system's ability to associate a
set of mentions with a specific underlying entity. This includes pronominal
mentions.
4.
The recognition of mention extent.
36
The RDC uses information from EDT and advance it one step further by finding
relationship between the different entities that are supported by explicit or implicit textual
reference. The target relations will be Role, AT, and Part, respectively linking person
with organization, organization or person with location, and person or organization with
person or organization. All three of the relations will contain fields for temporal
attributes. In addition, Role relation will also carry Role attribute to classify the
connection between two entities. 37
To satisfy EDT and RDC requirements, phrasing capability would have to be
expanded to include title, pronominal mentions, some generic mentions, and all GPE. To
this point, we have completed only part of the GPE tagging beyond the traditional NE
task. But we believe that techniques developed for NE task may be easily adapted to
building more sophisticated rules to tag the other mentions.
3.9 Treebank and PKU Data
University of Pennsylvania and Beijing University (PKU) have independently handannotated data available as a resource to computational linguists. This is a valuable tool
for machine training over wide range of issues. University of Pennsylvania's effort was
36
3
EDT Metonymy Annotation Guidelines Version 2.1,
2001
Preliminary Guidelines for Relation Detection and Characterization (RDC), Draft Version 2.2, 2001
Author: Daniel Hu
MIT-EECS
Page 34
5/29/01
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
aimed at creating a 100,000 word corpus of Mandarin Chinese text with syntactic
bracketing. Their effort consisted of two stages: the first stage is word segmentation and
part-of-speech tagging and the second stage is syntactic bracketing. Beijing University's
annotation effort was focused only on segmentation and part-of-speech tagging, although
later efforts had basic phrase identification. The size of their corpus is much larger, at
around 1 million words. Both research facilities used Chinese News data. University of
Penn annotated Xinhua News Agency articles. PKU annotated one month of People's
Daily. The tone of both source is very similar. For our purposes, segmentation and partof-speech tagging was enough. The so the University of Penn data was stripped of the
syntactic bracketing before being used for POS training and evaluation. The
segmentation guidelines for both annotation efforts are similar. But there are small
differences, and this is reflected in the percentage of words of different length (see table
2). Part of speech tagging is also different. Since University of Pennsylvania extended
their effort to include syntactic analysis, their part of speech had less detail, and relied
more on syntactic labels. PKU relied exclusively on part of speech and some limited
nominal differentiation to carry all of the information, therefore they developed a larger
tag set. Table 4 presents their POS tags in detail. The following is some sample data.
Larger samples can be viewed in Appendix E.
University of Pennsylvania Treebank Data:
<S>
((IP (NP-SBJ (NP-PN (NR r5f))
(NP (NN RMX)
(NN fA$)
(NN siLL)))
(VP (VP (VV
)
(AS T)
)
(NP-OBJ (NP-APP (NN i
(PU,
)
(NN j-)
(PU,
)
(NN
'()
(QP (CD At)
Author: Daniel Hu
MIT-EECS
Page 35
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
(CLP (M T)
(NP (NN Pr))))
(PU , )
(VP (NP-TMP (NT 8 i))
(ADVP (AD IE))
(VP (VV %<i2))))
(PU
</S>
)))
Input into Alembic system would appears as:
<S><LEX POS="NR">$i</LEX><LEX POS="NN">#
</LEX><LEX
POS="NN">f*t</LEX><LEX POS="NN">-Sk</LEX><LEX POS="VV">,
</LEX><LEX
POS="AS"> T </LEX><LEX POS="NN">PjAii</LEX><LEX POS=" PU">, </LEX><LEX
POS="NN">b &K</LEX><LEX POS="PU">, </LEX><LEX POS="NN">>
</LEX><LEX
POS="CD">TL</LEX><LEX POS="M">T-</LEX><LEX POS="NN">[ fr1</LEX><LEX
POS="PU">, </LEX><LEX POS="NT">HH 1t</LEX><LEX POS="AD">IE</LEX><LEX
POS="VV">) Y
4,7 </LEX><LEX POS="PU">, </LEX></S>
Beijing University People's Daily Data:
19 9 80110-01-001-023/m O/v DO/v 10/v 001/n ODD/vn,
/w D0D/vn,
/w 000/n O/w
Input into Alembic system would appear as:
<S><LEX POS="v">W</LEX><LEX POS="v">j Htl</LEX><LEX POS="v"></LEX><LEX
POS="n">'JN'-</LEX><LEX POS="vn">fM</LEX><LEX POS="w">, </LEX><LEX
POS="vn">f- $$</LEX><LEX POS="w">, </LEX><LEX POS="n">Lz~ji1L</LEX><LEX
POS="w">, </LEX></S>
Author: Daniel Hu
MIT-EECS
Page 36
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
4. Method
The methods section will attempt to provide reader with details on how we developed
various solutions to the problem and questions framed in the background section. We
will be using tools developed by author of this paper or adaptations of resources available
from other research labs. The organization of the methods section roughly corresponds to
the flow of data through the system. The first section deals with sentence tagging and
paragraph tagging, a prerequisite for anything passing through to the Alembic processing
environment. Then we describe our segmentation efforts, going into detail about how we
integrated three different segmenters into the system, along with a description of the OTS
software from Beijing University. Imbedded in the segmentation problem is also the
problem of name identification. The segmentation section will end with a conclusion and
summary of pros and cons of the different methods we have examined. The next section
provides solution to part-of-speech tagging. Through out the segmentation and POS
section, we mention using various derived wordlists. The nest section will give details
about how those wordlists are derived and merged. There is also an inventory of extra
wordlists that will help the NE and EDT tasks. The next large step in this process, after
Chinese text have been segmented and POS tagged, is NE tagging. We divided up the
phrasing task into machine training, scoring, generating frequency data, manual worldlist
development, and finally manual rule writing. Our efforts up till this point in time are
satisfactorily described by the above summary.
4.2 Sentence tagging
Sentence tagging is a relatively simple task for Chinese. In English, we run into
problems with ambiguous punctuations such as Mr. or P.O. W. In Chinese such mid-ofthe-sentence punctuation would never use terminal punctuations such as the Chinese
period (, ) or the exclamation mark or the question marks. The baseline for sentence
tagging would be a simple algorithm that reads down the string, and whenever a period or
an exclamation or an question mark is found, assume that the previous sentence have
ended and a new sentence begun. This is a good way until we run into spoken text and
Author: Daniel Hu
MIT-EECS
Page 37
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
casual writing such as email. There are two ways we have adopted in our code to
overcome these problems. For the problem of speech endings, we have created a long list
of possible punctuation strings. For the second problem, we listed a small group of
terminal punctuations, and if only those appear in a string of punctuations, then call it the
end of a sentence. We also run into the problem of headlines. Some text have headlines
between in the <TEXT></TEXT> region, and those rarely have end-of-sentence
punctuations. In order to minimize confusion to the segmenter, if sentence appearing on
a line of its own and the $insentence token displays false, tag the whole line as a
sentence. There may be exceptions to the rule, but this works very well for all newswire
text. To see sample output from the sentence tagging module, please refer to table 3 and
sample text in appendix.
4.3
4.3.2
Segmentation
MMS
The baseline for Chinese segmentation is the maximum matching string (MMS)
model which is implemented in Erik Peterson's code downloaded from the web. Chinese
read from left to right. So, all of the multi-character words can be matched by scanning
the text from left to right, and applying a greedy algorithm. All possible word matches
are compared with the words in a lexicon. Each loop of the algorithm a morpheme is
added to the string to the right of existing string, and if the new string exists in the
dictionary, the character is added to the string. If the newly formed string makes no
sense, then the old string would be segmented out as one word, and the new character
treated as the leftmost character of a potential new word.
Author: Daniel Hu
MIT-EECS
Page 38
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
Generic sentence:
Our test sentence: Chinese agricultureis developing quickly.
rJ (zhong - central)
rp M (zhongguo - China)
t
_
(zhongguonong - makes no sense, so nong is separated)
w _Vj f (*_nongye - * agriculture)
m_
rpmL
H
3_
_
(*_nongyefa - make no sense, sofa is separated
_M_R&R (*_fazhan - *_development)
_MVK_R
tR
P M_
_&R_
(* fazhanxun - make no sense, so xun is
Figure 6 Basic application of Maximum Matching String algorithm
This method works fairly well for short strings of one or two bounded morphemes.
But the most obvious problem is among words that does not progress in a onemorpheme-at-a-time fashion. Idioms are common in Chinese. Majority of the idioms are
composed of four or five characters. Many times the first one or two characters might
make sense as a word, but how do we take a leap from one or two character string to a
four or five character string? The key is building differentiation into the wordlist. When
a lexical dictionary is initialized for computation purposes, we will create two states for
each string. One is "this string exist as an independent word". The second state is "this
string may exist as an independent word, but only if you add more to it". We will also
add a temporary variable that can store the characters which are uncertain as to whether
they will be added to the previous segment. So when a string is looked up in the lexicon,
we get one of three possible states: "don't exist", "add more", or "this is a word". The
figure below illustrates the application of this algorithm.
Author: Daniel Hu
MIT-EECS
Page 39
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
Sentence with an idiom:
S1-JittQII 0 (This is the plan to kill two birds with one stone)
The idiom:
(yijianshuangdiao - Killing two vultures with one arrow)
Segment
State
add more
Temp Variable
add more
-'
(yi)
(zhe-shi-yige-yi - this is a ... )
j__t-MZ~i
(yijian)
(zhe-shi-yige-yijian - this is a ... )
add more
(zhe-shi-yige-yijian - this is a ... )
(yijianshuan)
word
(zhe-shi yige yijianshuangdiao this is a killing two birds with one arrow...)
AML- 4
_
(yijianshuandiao)
*cleartemp var
don't exist
NU'NW
(zhe-shi-yige-yijianshuangdiaode)
Wq (de)
*cleartemp var
_IJ
jIM -_@ t
(zhe-shi-yige-yijianshuangdiao deji)
don't exist
it (ji)
*cleartemp var
Figure 7 Working with idioms in the MMS algorithm
This segmentation algorithm forms our baseline performance. But it will run into
problems when the greedy algorithm cuts characters from other words, or bound free
morphemes. An example of that can be found in the figure below.
Original sentence
VAkbhii 7
L
(Thirty people attended the meeting)
Greedy MMS output
t_+X
fTi_4tLO
(Thirty ginseng... the sentence does not make sense)
Correct segmentation
Figure 8 An example of a common mistake that greedy segmenter will always make. The region of
ambiguity is displayed in bold.
Author: Daniel Hu
MIT-EECS
Page 40
5/29/01
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
Proposed solution to this common problem include incorporating a frequency model
(in this case, "attended" will have a higher frequency of usage than "ginseng"), and
utilizing part of speech information with a Hidden Markov Model, and applying heuristic
linguist rules. We will examine some of those proposed solutions and their
implementation in later sections.
Numerous MMS implementation for segmenting Chinese are freely available. The
version of code that we downloaded and modified is from MandarinTools.com. The
source code is written in PERL. The original wordlist consist of 23,000 words. We
downloaded alternative wordlists that ranged from 50,000 to 130,000 words. Wordlists
are a critical part of the lexical based segmentation process. The coverage of the
wordlists, the definition of a word underlying the wordlist, and the character encoding all
play a determining role on the performance and how detailed is the segmentation. But
when we use a particular wordlist, we also have to keep in mind the part-of-speech
tagging that we are doing in later module. So the wordlist used for segmentation has to
agree with the part of speech wordlist. The ideal is to have a POS dictionary with very
good coverage, so during our segmentation process, POS dictionary can be directly used.
MMS have built in numeral, time, and name identifier. The numeral and time
segmentation is simply utilizes lists of time expressions, measure units, and Chinese
numbers. In the greedy algorithm, if the string is completely formed from elements from
one of the above list, then tag it as a time or numeral expression, respectively. During
implementation and modification, we discovered there are numerous missing characters
on these lists, and so we augmented list with our own. Personal name tagging is
different. There is not easy solution to tagging personal names and the whole topic
deserves a section of its own.
Author: Daniel Hu
MIT-EECS
Page 41
5/29/01
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
4.3.3
Personal Names
There are two categories for personal names: Chinese names and phonetic
translations of foreign names. Techniques used in our segmentation tools to detect them
are different, thus they will be presented in two separate paragraphs.
The identification of Chinese names comes after the recursive loop that does the
initial segmentation. The wordlists does not contain any Chinese name, so presumably,
all the Chinese names are presented as unbounded morphemes. The process that follows
will attempt to identify which of the unbound morphemes look most like a name
sequence, and then merge them. Chinese names have a finite list of surnames to choose
from, and this list of surnames have changed little over recent history. Characters that
can make up given names are not limited except for some particles that absolutely can't
be part of a name. But the length of the given name is limited to two morphemes. The
most frequent length of a whole name is three. Any lexical base segmentation program
would first look for single surname character that is not bounded to larger strings. Then,
if it is followed by a title (looked up on a list of possible titles), or followed by one or two
unbounded morphemes, or preceded by a name predicate, the string is most likely a
person. This simple algorithm is coded into Erik Peterson's MMS script. As an extra
layer of precaution, the name identification subroutine also looks at a list of characters
that frequently appear unbounded, and which rarely appear in names. This list includes a
lot of particles, interjections, and prepositions.
The segment-then-merge approach of tagging names runs into problems when any
part of the name characters can be bounded to boundary morphemes. An example of that
is imbedded in following sentence:
t
FJIlI1 $1%-K2O
(erzhong de wanglixue xiaozhang fayan - chancellor Wang
Lixue of No.2 middle school speaks)
f1
Tl
(wanglixue) is the name of the chancellor. But when the name is put together
with the title fK (xiaozhang - chancellor), we can form the new word S
Author: Daniel Hu
MIT-EECS
(xuexiao -
Page 42
5/29/01
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
school). A literature search revealed extensive study of this problem. One proposed
technique uses frequency models38 . When an unbounded surname morpheme is found (it
did not mention the case where the surname is bounded), two morphemes to the left of
the surname are looked at. A list of character frequencies built on a very large bank of
Chinese names is used to determine if these two morphemes are "likely" to be part of the
given name. The likelihood of this whole name string is balanced by the probability of
the morphemes appearing alone (this include the surname), and the probability of the
leftover morpheme appearing unbounded. The general conclusion of this study is that
over 90% of the time, given name characters are unbounded. Of the ambiguous cases,
the second morpheme is likely to be separated from the polysyllable word and put
together with the surname, and the third morpheme is likely to remain in any polysyllable
word. Another problem with name identification is abbreviation. In some literature, the
appearance of the given name on its own is frequent. This is not the case for newswires,
however, as most Chinese names are written out in full. In some editorials, the full name
may be used once, and then only given name is used later in the article as a sign of
familiarity. Segmentation tools have very hard time catching this. Post-segmenterprocessing module for Alembic have a capability of remembering earlier mentions of the
full name and make a decision on whether later strings contain characters that
suspiciously match those of earlier mentions and tag them as personal names.
Handling foreign names is different from handling Chinese names. Foreign names do
not have finite surname list or length restriction. Fortunately, foreign names are usually
translations that derive their phonetics from a finite set of Chinese characters. This set of
characters range from widely used to characters that specifically mimic a sound. The
baseline approach to tagging foreign name is to find a string of these foreign character
that can not be placed in polysyllable words with boundary characters and put them
together to form a foreign name. 39 This is done in a way similar to the algorithm used to
tag numerals and time. There is still the problem of telling a foreign person name from a
foreign location name. This is currently an unresolved problem. However, majority of
38 Large Corpus Based Methods for Chinese personal Name Recognition, Journalof Chinese Information
Processing,6(3):7-15
39 Erik Peterson, 1996
Author: Daniel Hu
MIT-EECS
Page 43
5/29/01
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
the locations are followed with a location identifier, so the LISP post-processor is able
capture false tags and convert them into the correct ones.
Foreign
honetics
Surname
Non-name
Numerals
Classifiers
r7
rA
L
RX PU
f
Table 4 Sample list of the support wordlists used for segmentation. These are universal to MMS,
Unigram, and Rule based system.
4.3.4
Heuristic Rule Based System
The previous sections on maximum matching algorithm described the simplest
implementation of a Chinese segmentation code. But towards the end of 4.3.2, we gave
an example of an ambiguous case where MMS will fail every time. Without a look ahead
model, those ambiguity cannot be resolved. Prior research with look ahead models has
ranged from a simple two word look ahead, to hidden Markov models over the whole
sentence. In the end, the cost associated with building a very complex system, namely
speed, becomes inhibitive with only a very small gain in performance. When the
performance has reached a certain level, the solution provided by a complex system for
one type of ambiguity will usually create another. So what is the balance point between
performance and speed? A high number of research facilities have settled on three word
look ahead models. A paper in published by Chen and Liu in 1992 detailed one such
segmentation algorithm. It is simple enough that processing time is not dramatically
more than the simple MMS. It incorporates 6 rules to determine what is the best
Author: Daniel Hu
MIT-EECS
Page 44
5/29/01
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
segmentation. 5 of those rules were applied in our processing module. The rules are
listed below. Their rank of importance go from A to E.
A. Most plausible segmentation is the three word sequence with the maximal length.
With a simple look ahead, find the three word segments that , together, achieve
the longest length.
(correct)
ABC_D_EFG
(incorrect)
ABCDEF
B. When there is more than one way to choose the three word segments of equal
length, pick the word chunks with the smallest standard deviation for the word
length.
(correct)
AB_CD_EF
(incorrect)
AB_C_DEF
C. Normal words get higher priority than determinative-measure compounds. So if
AB_CD_E and AB_C_DE are the choices, and C is a measure compound, then
pick ABCDE.
D. Pick the chunks with the high frequency monosyllabic words.
( F is low frequency)
ABCDEF
(A is high frequency, then pick this one)
A_BCD_EF
E. Pick the chunks with the highest probability based on their part-of-speech
40
information.
There is an implementation of this method available off of the web. The code is
developed by Tsai for his Master's degree at University of Illinois. It was written in C
and used BIG5 input and output. Our system used GB character encoding. So a PERL
script was written to take GB input, convert it to BIG5 encoding, pass it to the segmenter,
and convert the BIG5 output back into GB. Tsai's code was tested against the simple
maximum string matching and, surprising, did not perform much better. While it
resolved some ambiguities, it also created new ones. It was also lacking a good time and
numeral and name algorithm. This problem is overcome by building the time and
numeral and name subroutine similar to one found in MMS into the PERL shell script.
This time, the performance was better than that of MMS, but still contained a fair amount
of errors. From observing the two preliminary results, there is a proposal to add
statistical model on top of rule A. We proposed that rule A combined with a simple
statistical model might be sufficient overcome some of the problems and we only need to
implement B and C to get significant performance improvement.
40
Chen and Liu, 1992 COLING
Author: Daniel Hu
MIT-EECS
Page 45
5/29/01
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
4.3.5
Unigram Frequency Model
The third segmenter fields a statistical rule in place of the maximum length rule of the
second segmenter. The rest of the rules are similar. This modification is aimed at
resolving some errors made by the first rule. Specifically, to deal with uncommon words
that might have a longer length than common words.
A. We will still find all the possible maximum string matching for three word
segments. For example, for the string ABCDEFG, we have found the following
ways to segment it into three word character sets.
3.ABCDEFG
2. ABCDEF
1. ABCDEF
In the original rule, third segmentation string would be picked because it is the
longest. In the new method we will observe the number corresponding to
frequency of occurrence for each word. Let's call that number cc. We will also
substitute a number rj for the length of the word. r is a constant for each word
length that increases non-linearly with the word length. For each Chinese word,
we will introduce a weight &11. (p is a set number depending on how much we
wish to weigh the frequency factor. For each possible three word string, the
weights of the words are added together and compared. If there are significant
differences, either due to the frequency of occurrence or due to length of strings,
we will pick the one with the highest weight. If difference is not significant, then
we will resolve the conflict by picking the maximum string. If difference in
weight is not significant and there are equal length strings, we will resort to lower
level rules.
The frequency data for words is found in a list downloaded from the web41 . There are
huge flaws to this design. First flaw is the way the frequency data is obtained. The
original segmenter with heuristic rules is used to scan through 17 million characters. The
resulting segments are stored and counted to obtain the frequency. So the frequency
count would be biased towards the mistakes that the rules-segmenter systematically
makes, the same mistakes that we are trying to correct with segmenter that incorporates
frequency data. Secondly, the wordlists used in the original rules segmenter does not
agree with our segmentation guidelines. As a consequence, the wordlist with frequency
data does not match the smaller wordlists that we would like to use to do our
segmentation. We feel that the wordlist with frequency data does not follow "smallest
unit of meaning". Thirdly, the original rules segmenter does not have a personal name
41
http://www.chinesecomputing.com/
Author: Daniel Hu
MIT-EECS
Page 46
5/29/01
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
identification algorithm. So we suspect that the frequency data would contain frequency
of characters from personal names as well, shifting the balance of segmentation towards
to monosyllable "words". In the end this trial at building a segmentation script is flawed,
and will ultimately be discarded for other software with higher performance.
4.3.6
PKU OTS Software
Up till now we have explored different segmentation codes. We have also done
modifications, and implemented our own with combined resources off of the web. But
the efforts focus only on segmentation. In the end, we would have to do part of speech
tagging as well. However, both of these capabilities are found in the OTS software from
Beijing University. We do not have the source code to PKU segmenter. Thus we can
only speculate on how their segment is implemented and give some examples of where
their software performs well, and some examples of where their software fails.
PKU segmenter takes in an article or sentence without any extraneous tags and
outputs segmented and part-of-speech tagged data. For some cases it performs
rudimentary phrase tagging. We suspect that their segmentation and part of speech are
implemented on two different levels, just like our system: segmentation on the first level,
and part-of-speech on the second level. With the most basic testing and deduction, we
know their segmentation algorithm incorporates a look ahead model. In cases where we
fed sentence with ambiguities, it almost always came back with the more common word
as the segmentation result. So we suspect that they use a frequency model to decide
between conflicting words. They also have a fairly powerful personal name tagger.
Behind this name tagger is the same algorithm surname-given name algorithm we have
discussed in section 4.3.4. But it is more complex because there may also be a frequency
model build into it. The following table illustrates some of the sample data that caused us
to come to these conclusions.
Author: Daniel Hu
MIT-EECS
Page 47
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
E+A)\H T
Deduction
Output sentence
Input sentence
tL
-+/m
A/n
j)
5/29/01
/v T /u
There is look ahead
1wcapability.
/n
,l/n , /w
nfjl fR]
)1
R/d
p
A/n *)jip/v Q3/n
Xf
Frequency of occurrence is
used when disambiguating
conflicts.
1I±/p j -/r T1j/v Extensive name tagging
capabilities.
7/u E+/m /q /w
Frequency model built into
E5 j/nr 4/v 1z7-K/n
the name tagging system.
/w
T -++-E/nr
e -Ei A T _T71
-E
_
__/v
/w
Figure 9 Input and output of the PKU segmentation &POS processor.
The inner workings of the part of speech tagging of the PKU software are harder to
deduce from the output. However, we believe that it is also lexicon based, and have a
frequency model. The part of speech assignments make systematic mistakes that we plan
to correct using LISP rule based module, such as making nouns into verbs, or verbs into
nouns.
The PKU software also attempts to do basic level named entity tagging, although the
named entity tags in the output is more of a distraction than anything we can use to
satisfy the named entity task. The way they do is very rudimentary: taking a large list of
organization and locations from the manually tagged People's Daily data and checks it
against all the input. For example, the software would be able to catch -M M FI :,%
(meiguoguofangbu - US Department of Defense) and tag it as one organization, but
would tag j 7f1' / M P@2
(Australian Department of Defense) separately as a location
and a noun. We build a LISP level NE task system based only on part-of-speech
information and segmentation, and will not rely on their phraser.
As a final note in describing the PKU software, it has different input/output
requirements than Alembic system. In order to integrate this software to our system, a
Author: Daniel Hu
MIT-EECS
Page 48
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
shell program written in PERL is built. This script takes sentence tagged articles and
pass each individual sentences to the PKU segmenter, then take the output from PKU
segmenter and format it into the LEX form that Alembic can read and use. Currently
such a system is implemented only for their web-based demo, where we pass individual
sentences to their demo site and retrieve the information we want from the resulting html
page. We have not yet obtained their software license, so we don't have their source
code or the actual program running on our computer, thus our testing has flaws. One of
the biggest problems is word limitation built into their web demo. We cannot input more
than 50 characters at a time. This cause the break up of some longer sentences along the
50 character mark, which potentially can throw off their segmentation process. The main
reason we have tested their software extensively is that their software uses the same part
of speech tags as their freely available tagged and segmented People's Daily corpus.
This large corpus provides valuable training for the phrasing task, the LISP segmentation
module, and POS module. Since the NE phrasing task is done with the assumption that
the part of speech information will follow the same format as the output of the PKU
segmenter, we are very keen on obtaining their software and making it a critical or
optional part of our system.
4.3.7
Rule Based Alembic Segmentation Software
One more avenue that we have explored in brief is a rule based segmentation module
based on LISP. It is peculiar because in this system we are trying to merge characters
rather than separating them. All of the sentences get parsed into character sequences
where each morpheme is treated like a word before being passed onto the segmentation
module. The segmentation module then uses a large word list to merge characters
together to form words. This process is seriously explored only as a machine trained rule
sequence. Training took place with 200,000 words from People's Daily, and after
learning a sequence of 160 rules, the performance level was 83% for training corpus and
80% for a smaller testing corpus. Because the performance level was so low, we did not
explore it further. But with the licensing of a larger and more comprehensive wordlist
Author: Daniel Hu
MIT-EECS
Page 49
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
with POS information, we maybe able to go back to this original concept and attempt it
again. More will be presented in the discussion section.
4.3.8
LISP Part of Speech Tagging
We have presented in sections 4.3.2, 4.3.3, 4.3.4, and 4.3.5 how we have approach the
segmentation problem independent of the part of speech problem. In this section, the
LISP based POS tagging solution is presented in finer detail. We have decided to use
Alembic rules environment to complete the POS tagging because: 1. it is flexible
environment, we can easily change the wordlists, add wordlists, and delete them without
changing much of the hard code, 2. it is trainable, and thus we can automatically generate
rules by giving the machine all the resources that a human rule-writer would have and let
it run over the large training corpus.
For the benefit of training, we have adopted the same set of tags as those provided by
Beijing University. The first step is to develop a POS wordlist to aid training and
tagging. This is done by collecting all the words from the People's Daily data, and
separating the words into files categorized by their part of speech information. One word
may appear in several files since they may be used in several POS positions. The part of
speech information derived from the People's Daily data is presented in the following
figure:
Author: Daniel Hu
MIT-EECS
Page 50
5/29/01
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
POS
Total
1 Char
2 Char
3 Char
4 Char
Other
Ag
Mg
vd
nr
a
ns
nt
b
c
d
321
7
528
35262
34473
27890
3573
8715
25479
47811
42734
27
5933
452
17201
47
3700
5033
8
10323
928
135
6015
236810
2837
70
39925
24244
32327
3836
20680
4562
74829
184764
1889
10
1391
496
1749
321
7
17
18537
12834
0
0
1913
18923
29845
127
23
496
112
12627
47
0
0
8
3723
928
135
0
16846
68
9
36339
22048
13800
0
0
4562
74363
64953
1857
10
0
496
1749
0
0
501
13927
21577
18584
7
6120
6228
16846
41747
4
5435
49
4499
0
2090
0
0
3896
0
0
26
186882
2769
51
3585
1909
18395
3655
14019
0
466
118616
32
0
963
0
0
0
0
10
1415
54
7297
3118
646
176
1117
857
0
0
133
75
0
1034
8
0
2169
0
0
686
26800
0
1
1
235
132
179
3579
0
0
1176
0
0
217
0
0
0
0
0
623
8
1354
416
26
152
3
3
0
2
50
0
0
437
4966
0
518
0
0
5104
5258
0
9
0
40
0
2
177
0
0
17
0
0
210
0
0
0
0
0
760
0
655
32
10
0
0
0
0
0
108
0
0
139
59
0
17
0
0
199
1024
0
0
0
12
0
0
2905
0
0
2
0
0
1
0
0
vn
e
ad
nx
f
h
nz
i
Bg
j
k
Dg
1
n
an
o
p
q
r
s
t
Ng
u
v
y
Rg
z
Tg
Vg
Figure 10 PKU POS of speech and the number of words associated with the POS.
columns represent the number of characters in each word.
The character
There are a total of 50,000 words. This covers most of the common words use in
Chinese. However, this is not the complete list. The complete list would have close to
double the amount of word coverage. Licensing for the complete list is still in the
process of negotiation. In the meantime, the derived partial list is used to train the POS
tagger.
Author: Daniel Hu
MIT-EECS
Page 5 1
5/29/01
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
In training, we used the People's Daily's data. Purist would argue that we should not
be using the same data as the data from which the wordlists are derived. From another
point of view, we are trying to tailor the rules for the situation where we have a complete
list instead of just a partial list. Theoretically the complete lists should have a very good
coverage. Thus, using the derived list to aid in tagging the corpus from which it is
derived from is a time saving measure. When we get the licensing for a good part of
speech dictionary for Chinese, we will not have to learn the rules over again because we
have much better coverage for a broad corpus selection. More about the part of speech
results will be presented in the results section.
4.4
Phrasing system in Alembic and OTS software
Our efforts for segmentation have been comprehensive. We have researched into and
tested several systems that we might consider as part of the final product delivery for
ACE Chinese. However, we are still in the process of evaluating our options, and have
built a phrasing system fairly independent from the segmentation process. Our
segmentation effort and part of speech tagging largely utilized the Beijing University's
hand annotated People's Daily data 42 . Thus, the output from the segmentation and partof-speech effort uses the same set of part of speech tags as Beijing University4 3, and has
the same segmentation guidelines. Our phrase training will be done with the same hand
annotated data, and be based on the PKU POS tag set.
Ideally, we would have segmentation and part-of-speech exploration completed first
before going on the phrasing. This way, the results from the phrasing module would
present the complete system results up to that step of the process. But time and resource
restraint meant that we had to pursue phrasing and segmentation/POS tagging in parallel.
So the training and results are all based on perfectly segmented and tagged corpus. To
get a sense of the final results up to NE task, we would have to multiply the final score of
42
See Appendix El
43
See Figure 5
Author: Daniel Hu
MIT-EECS
Page 52
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
the segmenter and POS tagger with the final score from the phrase task. There are also
benefits from building a system like this. As we explored NE phrasing, we develop a
clearer picture of the patterns underlying name, organization, and location phrases,
unclouded by systematic mistakes generated from the segmenter and POS tagger.
4.4.2
Training
We took People's Daily data (one large file) and converted it into a format the
Alembic phrase rule learner can read and saved into a database. The total database is
around 3,000 files, or 3,000 articles. The file was named with by the date of the article,
the section number to which it belonged, and the article number within the section. Then
we cat'ed the 3,000 files into 10 large files and 10 small files. Those 10 large and small
files are used for training and testing, respectively. To run the trainer, Alembic system is
first loaded into a LISP interface. Then we load all the resources that the Alembic system
is asked to use when learning, such as wordlists and list of tags. Finally we invoke the
learning process at the LISP interface.
The first training pass we did was cold training. All that was given to the system was
part-of-speech tagged corpus. The machine had nothing else to work with. In general,
the rules that it developed looked at the part-of-speech tags, and suffixes of Chinese
words for indicators of name, organization, and location. The second pass of training, we
gave the machine wordlists to work with. The wordlists included the surnames, POS
wordlists, foreign characters, location names, and other lists. This time the system
performed better. However, it was clear that it did not know how to use the full potential
of the wordlist that was provided to it. In the third pass, we separated the personal name,
organization, and location, and trained the machine on each one of them individually.
This is to see which area needs to be working on the most to achieve better over all
scores. Results from each of these pass and the section conclusion could be found in the
results section.
Author: Daniel Hu
MIT-EECS
Page 53
5/29/01
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
Supplementary wordlists
Capitals.txt
Classifiers.txt
foreignl.txt & foreign2.txt
Locations.txt
Nations.txt
Description
List of national capitals
Origin
Translation of
English wordlist
Manually
List of Chinese classifiers
compiled by
linguist
List of Chinese characters
Compiled from
used for phonetic translation MMS, MET2 data,
BBN data
List of Chinese and foreign Compiled from
locations
MET2 and BBN
List of nations
Organizations.txt
List of organizations
Surname.txt
List of surnames
Title.txt and Title2.txt
List of titles in Chinese
data
Translations of
English wordlist
Compiled from
MET2 and BBN
data
Compiled from
MMS, and
People's Daily
Manually
compiled by
linguist
usstates.txt
List of US States
Translations of
English wordlist
zh-provinces.txt
List of Chinese provinces
Manually
compiled by
linguist
Figure 11 List of the supplementary wordlist used for machine training. See Appendix B5 for
translation code.
4.4.3
Testing and scoring and error tracking
The Alembic system works with rules. Each rule is applied over the whole corpus,
and no rule is ever repeated. The general trend is to go from broad rules to smaller rules.
The rule learner functions much the same way. It is has a list of patterns that it will look
for. For each rule that it generates, it will attempt to achieve the maximum gain in
performance out of the list rules that it knows how to generate. It doesn't have any
ability to anticipate future rules. For example, sometimes it is more beneficial to tag the
specific things first, so that less general rules can be written in the future to avoid over
Author: Daniel Hu
MIT-EECS
Page 54
5/29/01
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
tagging. The training system would almost always produce a smooth curve going from
the broad rules to specific rules. Eventually it will reach a quasi plateau, when rules that
it generates are so specific to the training corpus that it can't advance the performance on
the testing corpus any more. We have graphed one training process below to show this
curve. An example of the learned rules can be found in Appendix B2.
Sample training results for NE-organization task
70 -
_- _-
_-
-
-
_ - _-
- -
- -
- -
- - -
60
50
2
40
30
20
10
-
fe
t-
O~
-
M
tn'~
-
r-
(=
-
M
W)
r-
cll
M
t
rt-
CO
Rule No.
Figure 12 This shows an example of the training for organization tagging. Upper line represents
performance on training data. Lower line represents performance on testing data.
The measurement we used for scoring is a unit called the f-measure. F-measure is the
average between precision and recall. Recall is the percentage of phrases that we have
correctly tagged out of all the correct phrases. Precision is the percentage of correctly
tagged phrase out of all the tags we have used. We can have a very high recall because
we have tagged almost all the phrases, but still get a very low f-measure because phraser
did a lot of extra flagging and the precision level is down. The goal is to achieve both
high precision and high recall: produce a lot of tags and make every tag count. The fmeasure is what the machine looks at when learning.
Author: Daniel Hu
MIT-EECS
Page 55
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
There are several scoring programs that we are using. The one build into Alembic is
same one that the machine learner uses. The second one is an external script called
scorify. The difference between the two scorers is the way it treats under-aligned or overaligned tags. In general scorify is more lenient with mis-aligned tags. It will count some
of them as a correct phrase if the tags caught most of the words in the phrase. The
Alembic scorer is strict with mis-aligned tags and will take a few percentages off of the fmeasure when compared with scorify. Scorify also have the advantage of being able to
keep track of the all the phrases, including the miss-tags, and spurious tags. This is
printed into valuable report. The following are three samples from our training and
scoring task. Figure 13 is a log file generated by training system. It uses the Alembic
scorer to generate the data. Figure 14 is the score results generated by scorify. This
particular sample is only concerned with organization tags. Figure 15 is the sample
report from scorify program.
Applying learned rules to TRAINING SET sentences.
No.
0
1
2
3
4
5
6
7
8
9
10
I
I
I
No.:
Key:
Ans:
Corr:
Miss:
Sprr:
Key
Ans
Corr
Miss
| R(%) [Lbl]
P (%) [Lbl]
F-measur Fired
Sprr
S+-------------------------------------------+----------------------------------------------------3665
0
0
3665
0
I 0.0
--0
3665
1010
2655
3080
27.557981
2070
30.53 I 3080
32.792206
3665
3670
1010
2655
2660
27.557981
27.520435
27.54 I 590
3665
3670
2564
1101
2569
I 30.040928
30.000002
30.02 I 597
3665
3635
1266
2399
2369
I 34.542976 34.82806
34.72
1410
3665
3635
1427
2238
I 38.93588
2208
39.25722
39.13 I 248
3665
3628
1437
2228
2191
I 39.208733 39.6086
39.45
629
3665
3594
1482
2183
2112
I 40.43656
41.235394
40.92 | 454
3665
1465
2863
2200
1398
I 39.972713 51.170097
46.13 | 731
3665
2429
1440
989
2225
I 39.290585 59.283657
49.46 | 434
3665
2568
1440
2225
1128
39.290585
56.074768
48.06 I 139
Rule number
Total number of phrases in the corpus
Total number of tags generated by the rule
Number of correctly tagged phrases
Number of phrases missed out of all the phrases
Number of spurious tags
R(%): Recall - Corr/Key
P(%): Precision - Corr/Ans
F-measure: (R+P)/2
Fired: Number of times the rules is applied
Figure 13 Sample log file from training. It shows increase in performance, the number of tags
applied per rule, the number of correctly tagged phrases, recall and precision. This information is
very valuable when used with rule sequence to trace what each rule did.
Author: Daniel Hu
MIT-EECS
Page 56
5/29/01
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
*
*
*
SUMMARY SCORES
*
*
*
POS
ACTI COR PAR INC I MIS
SPU NONI REC PRE UND OVG SUB ERR
---------- +------------------+-------------------+---------------------------------
I
SUBTASK SCORES
I
I
I
I
enamex
organizatio
9341 684
01
0
01
0
01
0
801
person
0
0
0
location
other
0
0
0
0
01
117
250
01
01
01
01
0
0
0
0
0
0
01
01
01
85
0
0
0
73
0
0
0
15
0
0
0
27
0
0
0
0
0
0
0
35
0
0
0
---------- +------------------+-------------------+--------------------------------1602
186811292
0
761 234
500
01 81 69 15 27
6 39
P&R
2P&R
P&2R
74.47
71.19
78.06
ALL SLOTS
F-MEASURES
Under Subtask scores, Enamex, Organization
POS: Number of real phrases in the corpus
ACT: Number of tags that the rules generated
COR: Number of correctly tagged
MIS: Number of missed phrases
SPU: Number of spurious phrases
Figure 14 Sample output from scorify program: *.scores ile output.
Document
1998010103009
TAG
TYPE TEXT KEYTYPE
ENAMEX
car
car
ORGANIZATION ORGANIZATION
ENAMEX
cor
cor
ORGANIZATION ORGANIZATION "
ENAMEX
cor
cor
ORGANIZATION ORGANIZATION
ENAMEX
mis
mis
ORGANIZATION
"G"Z""
ENAMEX
mis
mis
ORGANIZATION
"i""
ENAMEX
spu
spu
ORGANIZATION " "
ENAMEX
spu
spu
ORGANIZATION
RSPTYPE
KEYTEXT
"
RSPTEXT
"$--
"
"
".
f
"
fn
2
f HR
f
Figure 15 Sample output from scorify program: *.report-summary file output.
The information in figure 15 is very valuable in helping us generate manual rules.
More about the process of manually generating rules and testing them is presented below.
4.4.4
Compiling the frequency data
A detailed list of the results for the machine learned phrase tagging system will be
presented in a later chapter. To summarize, the machine was unable to achieve very high
performance for tagging organizations. So we had to find experiment with manually
Author: Daniel Hu
MIT-EECS
Page 57
"
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
writing rules to tag organizations. A sample rule sequence is presented in the appendix
section B3. The systematic approach to writing good rules for tagging organization is to
find the most frequent patterns and write rules based on those frequently occurring
patterns. A script was written to collect all organizations and their part-of-speech
information. Then the part-of-speech patterns that had more than 50 hits were recorded.
The sample results are presented below.
<LEX POS="n">t$ </EX><LEX POS="n">AXP</LEX><LEX PoS="vn">rf@</LEX><LEXPOS="n">$
<LEX POS="nt"> t/4I</LEX><LEX POS="n">IR
<LEX POS="n">!
1</LEX>
</LEX>
M</LEX><LEX POS="j">At</LEX><LEX POS="j">##?</LEX>
<LEX POS="n">Y[K</LEX><LEX POS="spos">ij
</LEX><LEX POS="n">iqIj fl</LEX>
<LEX POS="ns">cP M </LEX><LEX POS="j">:Z]X</LEX>
<LEX POS="ns">
</LEX><LEX POS="j">1j [</LEX>
<LEX POS="n">5</ILEX><LEX POS="n'>Y IJ</LEX><LEX POS="n'>:[j-</LEX>
<LEX POS="n'>M</LEX><LEX POS="j">[Ft3"</LEX>
<LEX POS="nt">*7Cij ,</LEX>
Figure 16 Sample of all the organization collected by scanning through the People's Daily data.
337
423
158
130
216
117
81
92
1364
57
160
89
60
130
64
70
2487
135
576
342
69
nz_n_
n n_
ns_n_n_n_
nsns_n_
nsJ_
n n_n_
ntj_
ns-n-vn-n_
ns_n_
nnin_n
nt_n_
nsnt_
ns-nsn-n__
ns vn n_
ns-nzn-n_
nt
nsnz_n_
nsnn_
n j
nz_n_n_
Figure 17 The most frequently occurring patterns to organizations.
Author: Daniel Hu
MIT-EECS
Page 5 8
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
4.4.5
5/29/01
Generating wordlists to aid tagging sequences
In the first and second pass of the machine training, we have not achieved the desired
results for NE task. The machine learning sequence was still geared towards English.
However, even as the machine can not achieve the results we are aiming for, it is already
noticing some particular traits of Chinese such as the suffixes that denote organization.
This "knowledge" is not applied in a generalized way, thus it is unable go above a certain
level of performance. We seek to take advantage of these "islands" and expand upon
them tag organization, and achieve higher level results with locations. This effort
resulted in a couple of key lists that is used for manually generated rules. There are the
base suffix list for organizations and locations which is collected from numerous
manually tagged corpus such as MET2 44 , and PKU's People's Daily. Then we have a list
of frequent words that contain the suffix for organizations but are actually not
organizations. The addition of these wordlist resulted in marked improvements for the
manual rules. In fact, each additional entry into the wordlist results in a slight but
noticeable improvement at this early stage. Below are sample entries from the lists.
Locations
Translation (example)
Harbor (Hong Kong)
Organization
e t9]
Translation
Company
Island (Hainan Island)
Department
River (Yangtze River)
Central
Lake (Kuiming Lake)
Agency
River (Panama Canal)
FP 5
Stream
Organization
Village
Gorge (Three Gorges)
Group
Department
PGtM
Government
Figure 18 Sample list of Organization and Location names.
44
See Appendix E3
Author: Daniel Hu
MIT-EECS
Page 59
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
Non-location
5/29/01
Translation
Figure 19 A list of frequently
occurring words that include
the organization suffix but
Officer
are actually not
SSociety
organizations.
7k A
4.4.6
Meeting
Manually Generated rules
Our manual rule generation focused on the poorest performing part of the NE task. It
seeks to utilize all the information and resources that we have collected in the most
effective way. The process of writing rules is a recursive process. First we write the
rules. Then the rules are implemented over our test corpus. Next, we score the test
corpus, and check both our results and the report-summary, with the missing and missaligned tag report. The report is carefully read through, and new items are added to the
wordlists and new rules are written.
There are two ways to approach tagging. One way is to identify islands within the
phrases, and then expand left and right accordingly. The other way is to find the POS
tags or words around the organization or location phrase that provide good hints. We
achieved our current results by mainly operating under the island principle. Future work
would probably require using both approaches.
The first rule for tagging organization is a rule to capture all the words that end with
the organization suffix. This action provides the "islands" off of which phrases can be
expanded. Then we write a rule to exclude those words that definitely cannot be an
organization but was tagged because it had the organization morpheme. Next we use the
frequency data with part of speech information, going from the most specific order of
POS tags to the most general. Finally we get rid of all the "islands" that did not fit under
any of the organization POS format. The written rules for tagging organization are
attached in Appendix B3.
Author: Daniel Hu
MIT-EECS
Page 60
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
Running the rules required developing a specification file to call on the rules and to
load the necessary wordlists. A sample spec file is attached in the Appendix B 1.
Wordlists
Alembic
Rules
Specification
iterface
Parameters, etc
Author: Daniel Hu
MIT-EECS
Page 61
5/29/01
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5. Results
There are two sections of results. The first section is a qualitative analysis of the
segmentation and POS tagging phase. The second section is a more quantitative analysis
of the NE task. The results from phase 1 and 2 are separate from each other.
5.2
Results from phase 1
5.2.1
Sentence tagging
We achieved close to 100% tagging capabilities for sentences from newswire. A
sample of this output can be found in Appendix Al. One of the reason for the high
performance is because the People's Daily and Xinhua New Agency data are highly
predictable and structured. The sentence tagging program had not been tested against
data which had false return carriages or bad punctuation. Another reason why we had
such good results is because we had not made the distinction between headlines and
regular sentences. In a more formal evaluation, headlines have to be picked out from all
the sentences.
This task presents less of a challenge in English because headlines are
simple capitalized sentences without a termination punctuation mark. For Chinese,
which have no capitalization, this task is a project by itself. Since we are seeking to
build an end-to-end NE system, not that much time could be spend the working on
distinguishing headlines from regular sentences. Thus they are all treated like regular
sentences.
5.2.2
Segmentation
The output from MMS, heuristic rules, and unigram segmenters can be found in
Appendix A2, A3, A4, respectively. The simple MMS segmenter downloaded from
MaradarinTools.com is converted to output text forms that Alembic system can read. It
is also outfitted with better numerals, surnames, foreign name characters, and measure
words. The adaptation of Peterson's MMS segmenter performed surprisingly well,
Author: Daniel Hu
MIT-EECS
Page 62
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
especially when compared with the preliminary results from the heuristic rules segmenter
with the look ahead model from Tsai. The biggest draw back to the rules segmenter is
that it did not have a good numeral or personal name identification algorithm. Many of
mistakes assigned to the heuristic rules segmenter are solved by implementing a script on
the output of the segmenter to merge names and numerals with an algorithm similar to
the one found in the modified MMS segmenter. However, even with the script, there are
surprising mistakes that the rules segmenter makes. Overall, aside from personal names
and numbers, it was able to solve more problems than it created.
(loose cargo container ships)
Sample segmentation result for:
Correct segmentation output
<LEX>R M</LEX><LEX>Mfr</LEX><LEX>
MMS segmentation output (incorrect because it was greedy)
j</LEX><LEX>r</LEX><LEX>
<LEX>R R
Heuristic rules segmentation output (correct because of minimum variance rule)
<LEX>MM</LEX><LEX>frL</LEX><LEX>
Figure 20 Comparison result for MMS and Heuristic rule segmenter. Heuristic rule made the
correct segmentation judgment.
The implementation of the unigram model proved to be very difficult. The reason is
the frequency list that we are using. Some of the potential problems have already being
discussed in the methods session, mainly the ill match between the wordlist that contains
the frequency data and the wordlist that we are using to build the segmenter. The result is
that unigram model for segmentation functions just like the heuristic rules segmenter.
The frequency for the word segments did not vary enough to make a real difference in
most of the segmentation ambiguities. However, we will soon acquire a much better
frequency wordlist from University of Pennsylvania. This wordlist is machine compiled
and manually edited by linguistic researchers, and mostly agree with our segmentation
guidelines. Integrating this wordlist into the unigram frequency model segmenter should
produce results more in line with the expected improvement.
Even though the theories behind the segmenters we built or modified represent the
best ideas of the field right now, our segmentation results still lack behind that of Beijing
Author: Daniel Hu
MIT-EECS
Page 63
5/29/01
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
University's segmentation software's 4 5 . The OTS software seems to have a better
lexicon and a more sophisticated way to disambiguate conflicts. It also seems to have a
better frequency wordlists, which helps at most times, and throws the segmenter off at
some rare instances. See figure 9 for specific example. But unless we make significant
improvements to our lexicon, and add a good LISP level correction module (see
discussion), the software of choice is still PKU's segmentation/POS program. This view
is reinforced when we take a closer look at results from training in part of speech.
5.2.3
Part of speech tagging
Segmentation cannot be viewed independently from part of speech. Without a solid
part of speech tagging module our segmentation effort is fruitless when we reach the
phrasing level. We trained a LISP part of speech tagging module over a segmented
corpus of the People's Daily data. The training took around 3 days. The learner
generated close to 100 rules. The final result on the training data was 83.7% f-measure,
and 79.2% f-measure on the testing data. This is significant because with this level of
performance on machine generated rules, we believe manual correction or manually
written rules would significantly boost performance.
We do not have a quantitative measurement for the performance of the OTS software.
From observation, it made some systematic mistakes. Most of the mistakes are not the
result of miss-segmentation; they result from conflict between part of speech
assignments. An example of that mistake is present in the figure below.
POS Tagging Result for OTS software: f@
correct)
Correct POS tags:
fh/r Jlu lJtI/vn fR/d jE /a ; /w
Miss tagged as:
/r Jj/u J[LiT/v fR/d IE A/a , /w
iZRIEi
0
(Your criticism is
Figure 21 POS mistakes made by OTS software.
45
See Appendix A5
Author: Daniel Hu
MIT-EECS
Page 64
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
Such examples of mistakes are few though. They are usually due to a missing word
in the part of speech lexicon. The general performance is higher than the machine
learned rules. This will continue to be the case until we have better resources on which
built better rules.
In conclusion, the result showed us that we can build a rudimentary segmentation and
part of speech tagging system based on the resources that we have derived or obtained off
of free web sites. The performance of this system is on par with capable software in the
industry right now. Still using the same framework, we can boost this performance by a
huge amount if we can get better wordlists, and manually generate rules for the part-ofspeech tagging module. Despite this confidence, we also acknowledge that the current
system is under performing compared the Beijing University's OTS segmentation and
POS software. Thus, the Beijing University software should be licensed. There will be
two main benefits. One is we already will have a state-of-the-art software to complete
phase 1 of this project. Two is we will have a very complete part of speech dictionary
and a finely segmented wordlist, and this will allow to a more complete exploration of
our own approaches to phase 1.
Author: Daniel Hu
MIT-EECS
Page 65
5/29/01
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5.3 Results from phase 2
5.3.1
Machine Learned Rules for NE Task
F-measure
65.27%
Resources
None, no wordlists
were provided.
NR
Cold train session.
80
NT
NS
NR
84.91%
Part of speech
wordlists derived
from People's
150
87.58%
POS wordlists
86.98%
10
NT
NS
NR
NT
NS
NR
NR
10
50
50
NT
NT
NT
41.50%
53.80%
57.24%
POS wordlists
Supplement lists
Suffix lists
POS wordlists
Surname lists
Foreign phonetic
characters
POS wordlists
POS wordlists
POS wordlists
40
50
NS
NS
88.90%
90.50%
POS wordlists
POS wordlists
No. Rules
50
Tags
NT
NS
Daily
70
99.91%
Org morpheme
Loc morpheme
list
list
Table 5 The training results for NE task.
The above table shows the main results from training log for NE task. In the first row
we showed an f-measure of 65.27%. This is training without any resources. All of the
later training would take advantage of the wordlists derived from the PKU data. The fmeasure shown in all cases are results for the testing data, not the training data. Except
for the first learning session, the training data always has a 3 to 4 percent advantage over
the test data. This is attribute to the fact that the wordlists are derived from the training
data, plus the rules are tailored specifically to boost performance in the training data.
Author: Daniel Hu
MIT-EECS
Page 66
5/29/01
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
There are some clear trends in the numbers on the table. We will first focus on the
processes that tried to learn rules for tagging all the named entities. In general, the
performance goes up as the number of rules is increased, unless additional resources are
added. We can see this in the improvement of f-measure from row 2 to row 3. There is a
2.67% increase in performance as the rules are almost doubled from 80 to 150. As the
machine reach the limits of its intelligence, each additional rule only result in a very
small increase in performance, sometimes not at all in the testing data. Thus we see the
clear plateau.
With the addition of new resources such as wordlists, the performance goes up even if
the number of rules remains the same. This is illustrated in the 2.07% increase in
performance from row 2 to row 4. The number of rules decreases from 80 to 70, but this
is more than offset by the addition of useful new resources.
After reviewing the tagged results from the first sequences of learning, we start
training the machine with one tag at the time. Immediately, it is clear that the
encouraging results we have are mostly due to the correct part of speech information
from segmentation. Personal names accounts for majority of the NE tags (70%). An
independent training on personal name achieved 99.91% f-measure within 10 rules. But
we know personal names are already pre-tagged as NR. The surname and the given
names are separate. But a simple rule that put these together would have resulted in a
very high score. The second most prevalent named entity are locations. They are also
pre-tagged in the segmentation process. Majorities of the locations are single words with
NS as their part of speech. When the trainer learns how to independently tag NS, the first
rule it generates is to convert the NS part of speech tag to NS phrase tag. This
immediately takes its f-measure to high 70s. The recall is very high at that point. But
precision is fairly low because it over flags all the locations that are used inside
organization names. The rules following would mainly try to correct the over flagging.
We can see that the location tagging learner achieved 90.50% f-measure in 50 rules.
Author: Daniel Hu
MIT-EECS
Page 67
5/29/01
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
The only named entity task that the machine cannot "cheat" on is organizations.
There are organizations that are only one word, but those are rare. Thus the learner
cannot use the part of speech information directly. The result is a punishing score on the
organization tagging task (57.24% with 50 rules). List of morphemes that denotes some
kind of organizational structure was generated and given to the learner as part of its
resource. This only resulted in a slight increase in performance. The learner does not
really understand how to use this list to the maximum advantage. Manually generate
rules that really take advantage of the wordlists and patterns can achieve high results with
minimal rule sequences.
5.3.2
Improvements with Addition of Custom Wordlists and Manually
Generated Rules for NE Task
18 rules were written for the NE task of organization tagging. Please see Appendix
B3 for the complete list. The way these rules are written is discussed in the last section
of the methods section. They achieved significantly higher results. By the Alembic
scorer, the f-measure for organization was 73.97%. Scoring using the program scorify
produce an f-measure of 74.47%. The report-summary showed all the correct, missed,
and spurious tags. There were a lot of spurious tags, which keeps the f-measure low.
However, linguists that looked at the spurious tags determined that at least 30% of them
are actually legitimate. It is a case of disagreement over NE guideline with the People's
Daily annotated data, which seemed to be inconsistent in many respect. The following
table shows the kind of organization that they would tag in one article but would ignore
in another.
Author: Daniel Hu
MIT-EECS
Page 68
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
Organizations
,7Jd/k!Jf
T
T 7<L
;,' 3~~~
5/29/01
Translation
Waste treatment plant
Banyanpu College
Representative Body for Committee on Peace
2-6SI
2F_ K
VP%54125Z
Beijing Military Unit
Literature Studies Society
Central Military Task Force
Figure 22 A partial list of tags marked as errors by the scorer because they are not tagged as
organizations in the PKU's People's Daily data.
The result shows clearly that we can achieve high performance for NE tasks in
Chinese using the Alembic rules environment with the right wordlists and rule sequences.
Noticeable improvements come with every extra addition to the still-limited wordlist. To
avoid spurious tagging, a list of non-organizations with organization morphemes is
generated to aid the rule writing process.
Author: Daniel Hu
MIT-EECS
Page 69
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
6. Discussion
6.2
Separation of phase 1 and phase 2
When name entity detection is attempted in the Alembic LISP based system, correctly
tagged corpus from People Daily was fed into it. It is point of controversy because this
corpus is perfectly tagged. How will we know the real performance of the NE task? The
reason why perfect data is passed onto Alembic system for training is a combination of
time constraint and the effort to train a pure named entity tagger without error correction
for the segmentation and part-of-speech.
We wanted a modularize system. A post process module is envisioned to make
corrections to the segmentation and part of speech tagging process. Our NE rules will be
after the correction module, thus it shall be pure to the NE task. Whatever error
correction could be done in the LISP rule based environment would have been done
already. To build a modular NE phraser, the best data to pass it is the correctly tagged
data.
So what is the real performance of the system up to this point? There is no
quantitative answer to that question. It will be answered as part of our future work. But
it does not belong in this project report.
6.3
Segmentation
Can segmentation be done using a rule based system? The initial results are not good.
The instinctive answer would be no, because rules are applied over the whole corpus, and
they do not facilitate scanning strings in a localized area. However, there maybe a
curious to way to implement a rule based segmenter, and it would actually use frequency
information and have a strange form of "look ahead".
The corpus to be segmented will be passed onto the segmenter completely chunked
into individual morpheme (the same format we are using to pass to Alembic segmenter
right now). We can arrange the frequency wordlist according the frequency of words,
Author: Daniel Hu
MIT-EECS
Page 70
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
starting from the most frequent to the least frequent. Then we will have one rule that will
go down the wordlist and scan for each word in the corpus. If it finds words or
morphemes that can be merged to form the word from the wordlist, it will merge them. It
will run through each word in list and attempt to find a match. But it will only call it a
match if the new word can be formed without tearing old bond apart. This way we will
always be tagging the most frequently occurring words whenever there are conflicts in
segmentation. We can also try the same segmentation algorithm, but arrange the
wordlists according to length. This will be explored in future work.
6.4
Post-process for OTS module
In testing the Beijing University software we noticed some systematic errors that it
makes. These errors can be corrected using module in Alembic. One of the common
errors is miss-tagging. They result from oversight when building the lexicon. One
example of this is presented in figure 21. An rule that can correct this particular mistake
is:
(def -phraser-rule
:anchor
:lexeme
:condition
(:wd :pos :|vI)
(:lwd-1 :pos :Iu)
(:set :wd :pos :IvnI)
:action
)
This is just one of example of many errors that we can attempt to correct in Alembic.
This should boost the performance of our segmentation and POS tagging.
Author: Daniel Hu
MIT-EECS
Page 71
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
7.
5/29/01
Conclusion
The mass of electronic information grows exponentially as more of the world
becomes wired. Intelligent content extraction software that can help individuals and
organizations make mission critical decisions based on this growing maze of information
is becoming a necessity. A part of this effort focuses on building a multilingual
information extraction system. This Master's project is a comprehensive study of the
preliminary requirements to outfitting MITRE's Alembic Natural Language Processing
tool with Chinese Automatic Content Extraction capabilities. It included the study of
segmentation, part-of-speech tagging, and named entity phrase tasks.
This project has successfully explored the essential task of segmentation and part of
speech tagging for Chinese. We can draw the following conclusion from our results:
1.
Existing segmentation algorithms have been implemented for the pre-Alembic
process based on code written and modified through this project. Results are
on par with industry standards.
2.
Replacement of the wordlists and resources within the existing
segmentation/POS system can results in significant improvements.
3.
State-of-the-art software from external sources is shown to work well with the
Alembic environment. Should we choose not to develop our own segmenter,
license for OTS software can be readily obtained.
This project has successfully explored the Name Entity task. We can draw the
following conclusion from our efforts:
1.
High performance can be achieved through a combined effort between
segmentation, POS tagging, and rule based phrasing despite the unique
problems facing Chinese text processing.
2.
Manually generated rules and wordlists based on scientific analysis of named
entities in Chinese can perform significantly better than simple machine learned
rules.
Author: Daniel Hu
MIT-EECS
Page 72
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
8.
5/29/01
Bibliography, Reference, and Resources
8.2 Web based resources
http://www.mitre.org/resources/centers/it/063/alembic.html
http://www.mitre.org/resouces/centers/it/gO63/workbench.html
http://www.icl.pku.edu.cn/nlp-tools/segtagtest.htm
http://www.icl.pku.edu.cn/Introduction/corpustagging.htm
http://www.MandarinTools.com
http://www.geocities.com/hao5 10/mmseg/
http://www.ldc.upenn.edu/ctb/
http://www.chinesecomputing.com/
8.3 Academic papers and publications
Benjamin K. Tsou, Hing-cheung Ho ... , A Knowledge-based Machine-aided System for
Chinese Text Abstraction, In Proceedingsof the COLING 1992, Page 1039-1042
Benjamin L. Chen and Von-Wun Soo, An Acquisition Model for both Choosing and
Resolving Anaphora in Conjoined Mandarin Chinese Sentences, In Proceedingsof the
COLING 1992, Page 274-357
Chang, Chao-Huang, and Cheng-Der Chen, A study on integrating Chinese word
segmentation and part-of-speech tagging. Communications of COLIP, 3(2), December
1993
Chao, Yuen Ren. 1968. A Grammarof Spoken Chinese. University of California Press,
1968
Chen. Aitao, Jianzhang Hu, and Liangjie Xu. 1997. Chinese text retrieval without using
a dictionary. Proceedingsof the 2 0 th Annual InternationalACM SIGIR Conference on
Research and Development in Information Retrieval. Philadelphia, Philadelphia, July
1997
Chu-Ren Huang, Keh-Jiann Chen, A Chinese Corpus for Linguistic Research, In
Proceedingsof the COLING 1992, Page 1214-1217
Author: Daniel Hu
MIT-EECS
Page 73
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
Huang, Chu-Ren; Chen, Keh-Jiann; Chang, Lili; and Chen, Feng-yi. 1997. Segmentation
standard for Chinese natural language processing. International Journal of
Computational Linguistics and Chinese Language Processing. 2(2) , 47-62
Jerry Norman, Chinese, Cambridge University Press, 1998, Page 152-180
Jerome L. Packard. The Morphology of Chinese: A Linguistic and Cognitive Approach.
Cambridge University Press, 2000
Jian-Yun Nie, Marie-Louise Hannan, Wanying Jin, Unknown Word Detection and
Segmentation of Chinese using Statistical and Heuristic Knowledge, October 1995
Jones, Russell. 1997. Chinese Names: The Traditions Surrounding the Use of Chinese
Surnames and Personal Names. Pelanduk Publications, Selangor Darul Ehsan, Malaysia,
1997
Jyun-Sheng Chang and Andrew Chang, A Statistical Approach to Machine Aided
Translation of Terminology Banks, In Proceedingsof the COLING 1992, Page 921-925
Jyun-Shen Chang, Shun-De chen, Ying Zheng, Xian Zhong Liu, and Shu-Jin Ke, Large
Corpus Based Methods for Chinese personal Name Recognition, Journalof Chinese
Information Processing,6(3):7-15
Karen Spark Jones, Automatic Summarization: Factors and Directions, In Advances in
Automatic Text Summarization, The MIT Press 1999: Page 1 -14
Keh-Jiann Chen, Chinese Sentence Parsing, PresentationSlidesfor COLING 92.
Keh-Jiann Chen and Shing-Huan Liu, Word Identification for Mandarin Chinese
sentences, In Proceedingsof the COLING 1992, Page 101-107
Kwok, K.L. 1997. Comparing representation of Chinese information retrieval.
Proceedingsof the 20th Annual InternationalACM SIGIRE conference on Research and
Development in Information Retrieval, Philadelphia, Philadelphia, July 1997
Liang-Jyh Wang, Wei-Chuan Li, And Chao-Huang Chang, Recognizing Unregistered
Names for Mandarin Word Identification, In Proceedingsof the COLING 1992, Page
1239-1243
Liu, Shing-Huan. 1995. An Automatic translator between Traditional Chinese and
Simplified Chinese in Unicode. Proceedingsof the 7th International Unicode Conference,
San Jose, California, September 1995
Richard Sproat, Nancy chang, A Stochastic Finite-State Word Segmentation Algorithm
For Chinese, 5 May 1994
Author: Daniel Hu
MIT-EECS
Page 74
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
Packard, Jerome L., editor. 1997, New Approaches to Chinese Word Formation,
Morphology Phonology and the Lexicon in Modern and Ancient Chinese. Mouton de
Gruyter, Berlin, 1997
Thomas Emerson, Segmentating Chinese in Unicode, Technical reportfor Basis
Technology Corporation,In Proceedingsof 1 6 th International Unicode Conference.
Wu, Zimin and Gwyneth Tseng. 1995. ACTS: An automatic Chinese text segmentation
system for full text retrieval. Journalof the American Society for information Science,
46(2):83 - 96, March 1995
Yuan Chunfa, Huang Changning and Pan Shimei, Knowledge Acquisition and Chinese
Parsing Based on Corpus, In Proceedingsof the COLING 1992, Page 1300-1304
8.4 MITRE reports
David D. Palmer, A Trainable Rule-based Algorithm for Word Segmentation, In
Proceedingsfor ACL-1997, 1997
David D. Palmer, Tokenisation and Sentence Segmentation, A Handbook for Natural
Language Processing, 1997
David D. Palmer and John D. Burger, Chinese Word Segmentation and Information
Retrieval, 1997
John Aberdeen, John Burger, David Day, Lynette Hirschman, Patricia Robinson, and
Marc Vilain, Description of the Alembic System Used for MUC-6, Released for
publication at 6 th Message Understanding Conference (MUC-6), 1997
Lisa Ferro, Marc Vilain, Alexander Yeh, Learning Transformation Rules to Find
Grammatical Relations, Released to Computational Natural Language Learning
workshop 1999
Marc Villian, Inferential Information Extraction, Internal Report and Guidelines.
8.5 Operating Manuals and ACE/MUC Guidelines
ACE Team Members and Participants, EDT Metonymy Annotation Guidelines Version
2.1, Draft Released at ACE meeting 2001
ACE Team Members and Participants, Entity Detection and Tracking - Phase 1, ACE
Pilot Study Task Definition, 2000
Ada Brunstein and Lance Ranshaw, Preliminary Guidelines for Relation Detection and
Characterization (RDC), Draft Version 2.2 Released at ACE meeting 2001
Author: Daniel Hu
MIT-EECS
Page 75
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
Beatrice Santorini, Part-of-Speech Tagging Guidelines for Penn Treebank Project,
University of Pennsylvania, June 1990
Fei Xia, The Segmentation Guidelines for the Penn Chinese Treebank (3.0), University of
Pennsylvania, October 2000
Multilingual Entity Task (MET) Definition for Chinese, 01 March 1996
Shi-Wen Yu, Modem Mandarin Part of Speech Lexicon Guidelines, Beijing University
Institute for Computational Linguistics, 1999
Author: Daniel Hu
MIT-EECS
Page 76
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
Appendix A Sample Segmentation Results
Al
Sample Sentence Tagged output
<DOC>
<METID> 001 </METID>
<ID> raw/950715/tg/gnxw0718.15 </ID>
<DATE> 1995 07 18 </DATE>
<H L> Vf4-] *t </HL>
<AU> Domestic News Dept. </AU>
<TEXT>
<p>
i8j, 7 - 1 6 H
: 9 5 r3U ILj fRfMt
-ff<LEX>. </LEX></S><S>±nU Wz, 6M
, EX <LE,
fSSnt
E cNM A-i>&#, 6 L A# T Nf<LEX></LEX></S><S><IS>
tJM9
E Ei
1 <S>
>Ai
2
</LEX></S>2. <S>;2
]Z<LEX>, <ILEX><IS><S>6<LEX>.
et2W$fj@M : 53dB~iiitFMiM2%W]12)i5fliM M, 'L, 1 4 W Z
_T<LEX>. </E>/><>*t%
, I tt* #%85% <LEX
>.
</LEX></S>3.
Ri, 1 5
H 9M-*5: <LEX>.
292 3 1
.
~2KN) 1 3
<
LV'1)Eh$J
250n I
3 3 <LEX>. </LEX></S><S> 6 *
</LEX></S><S>rLMj
_T I
< EX>, </LEX></S><S>iJ <Q-hjA
sE
4fj
<S>jE
:
H E~RzL<LEX>. </LEX></S><S>
& II- P-fi)I
X>-$2, IL := 7 -F -]E>T<
0 , HT -E PR
-i-ZWAF$,
A <LEX>
<S >6 IXt <LX>. f
X> </ E > </LEX></S><S>
LE~f
f
<lg,&#1%&
LE X</S>S>MM M& 22 2%M E[-]1322
1p ,
, EP,,jjT[E
>W&A </LEX><t/S><S><LEX>
-:A H
$
1j
T-4E
-
i
<ILEX><S
M ffi-P)-<LEX>. </LEX></S> 8. <S>r$ff1:UAP
iGA -i
,
9. <S>tPVT%:$5M%%-
EX>. <LX<S<>$aijggt2NMf2iN
Author: Daniel Hu
</LE></>6.
/\S
S>
*4LE
<,
IE-M'&)tE<LEX>.
</LEX><ILEX>I->,
<.
LEX> </LEX>< >. </LEX>> </S><S > < /S><$1 4
7R }J;),
,
,_
t,< X
)</LEX></S><S>
>9. "S "±,
i~fl
>
</LE X><
S><L>I.E/><
N~itBM~-jf)Qjj\.Ajt<LEX>. </LEX></S><S>tT
-F[jVf)
It ,{
,3
:
W~iM W 4%, $ 5536%&M e
<L
J@$
#W~N# I ~
WE
X>.</L EX></S><S>-ft A
RfW h d tg~tehh M R W %, &i1])jKA lr &M <L EX>. </L EX></S
AMt
$IAPJgt
BRERR<
tiMWMLX
MIT-EECS
Page 77
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
</LE></><SMMMM
Pe,
AWWg*f
LTM,W-Y-
iq,
XRil~%yggiia-i<LEX>,
2k)
J+2m,
M~in,
IV /I@53
\ q-IAL WA955
r0-Tt5Iji,
jj/bR
<LEX>,
nA
</LEX></S><S>jXLm
II',
::
kW/
dA
L!-r_f-FS-EM2:V
t-M {
O<LEX>,
17 /d
00
7fl0
M--T%<LX>.</LEX></S><S>A)NflLIhfT
</LEX></S><S> 1 0 r
LVz <LEX>, </LEX></S><S>R
5/29/01
M'
LV17T*/1\,
jHfli
LEX ></S<Ij-J
< EX> </LEX></S><S>JL
</L EX>/S<S
J'J@LW'j±
Js*9F-f1
-
!T
LE
@t±]-
X>. </LEX></S>
<p>
1 8/7)
(fW N
</TEXT>
</DOC>
A2
Sample segmentation output from MMS
<DOC><TXT>
<LEX>001 </LEX>
<LEX>raw/950715/tg/gnxwO7l8.15</LEX>
<LEX>1 995</LEX><LEX>07</LEX><LEX>1 8</LEX>
<LEX>fT-[J-</LEX><LEX>*-t</LEX>
<LEX>Domestic</LEX><LEX>News</LEX><LEX>Dept.</LEX>
<LEX> 1</LEX><LEX>.
</LEX><LEX>1
TYPE="PERSON"><LEX> W
9 5 </LEX><LEX>
</LEX><ENAMEX
</LEX><LEX>jE
</LEX></ENAMEX><LEX>K1%</LEX><LEX>
: </LEX><LEX>
</LEX><LEX> E</LEX><LEX>11 </LEX><LEX>$f</LEX><LEX>R
</LEX><LEX>/M</LEX><LEX>M</LEX><LEX>Maff</LEX><LEX>,
</LEX><LEX> 7 </LEX><
LEX>TJ</LEX><LEX> 1 6 </LEX><LEX> H</LEX><LEX>t</LEX><LEX>$$$t</LEX><LEX>
L-</LEX><LEX>-</LEX><LEX>Rff</LEX><LEX>.
</LEX><LEX>,
LEX>
>
</LEX><LEX>$Fi1</LEX><LEX>,
1M
</LEX><LEX>,
</LEX><LEX>@
LE
</LEX><LEX>*
n</LEX><LEX>Y
ATM
</LEX><LEX> F] 1r</LEX><LEX>,
</LEX><
</LEX><LEX>Elip )g</LEX><LEX>, </LEX><LEX>jf)jpJn</LEX><LEX
4j</LEX><LEX>@</LEX><LEX>A.
±</LEX><LEX>Th</LEX><LEX>@E</L
EX><LEX>ro</LEX><LEX>Viq</LEX><LEX>,W</LEX><LEX>A i±</LEX><LEX>fp</LEX><LE
X>,. W</LEX><LEX>E A</LEX><LEX> 6 - -</LEX><LEX>AX#</LEX><LEX>u
T </LEX><LE
X>ff-</LEX><LEX>.
</LEX><LEX>E</LEX><LEX>2</LEX><LEX>$fjh&</LEX><LEX>
Mj</LEX><LEX>
rc'
</LEX><LEX>) *rT</LEX><LEX>
_w</LEX><LEX>,
</LEX><LEX>F W</LEX><LEX>jM</LEX><LEX>
4'</LEX><LEX>rj</LEX><LEX></LEX><LEX>)j
</LE
X><LEX>2{B</LEX><LEX>-$R</LEX><LEX>fn</LEX><LEX>RM</LEX><LEX>@</LEX><LE
X>
R</LEX><LEX>MfM</LEX><LEX>.
</LEX><LEX> 2 </LEX><LEX>.
</LEX><LEX>*- -</
LEX><LEX>_ t</LEX><ENAMEX
TYPE="PERSON"><LEX>lEfjg</LEX></ENAMEX><LEX>$g</LEX><LEX>
Author: Daniel Hu
MIT-EECS
: </LEX><LEX>
Page 78
5/29/01
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
ni
w</LEX><LEX>, q IL</LEX><LEX>
L</LEX>E7fX>
</LEX><LEX>32-I</LEX><LEX>j</LEX><LEX>@
it</LEX><LEX>/LE></LEX><LEX>
9</LEX><LEX>]</X><LEX>LX
LEX><LEX> 1 5 7 </LEX><LEX>
</LEX><LEX>&Mf</LEX><LEX>
><LEX> 1 4 </LEX><LEX> H </LEX><LEX> ZT</LEX><LEX>,
</LEX><LEX>,
</
</LEX
</LEX><LEX>jxLjT</LEX><LEX
>3-</LEX><LEX>hk,±</LEX><LEX> ri M </LEX><LEX> P 2 </LEX><LEX>*fjt </LEX><LEX>
fc</LEX><LEX>, </LEX><LEX>Mj</LEX><LEX>dk-</LEX><LEX>3T,</LEX><LEX>c3
</LEX><LEX>A-
</LEX><LEX>C</LEX><LEX>jyji</LEX><LEX>.
LEX>. </LEX><LEX>-
</LEX><LEX>-'E!</LEX><ENAMEX
TYPE="PERSON"><LEX>3$fr
</LEX></ENAM EX><LEX>V-</LEX><LEX>
</LEX><LEX>f]j</LEX><LEX>
: </LEX><LEX>
</LEX><LEX>jE_</LEX><LEX>F</LEX><LEX>W$IJ
*</LEX><LEX>,FJL</LEX><LEX>*k
rp
X><LEX>j
</LEX><LEX> 3 </LEX><
</LEX><LEX> 2 5 0 </LEX><LEX>41</LEX><LEX>TUFf</LE
AJL</LEX><LEX>, </LEX><LEX> 1 5 </LEX><LEX> H </LEX><LEX>
LEX>)E</LEX></LEX><LEX><E.
LXLX-</LEX><LEX>7j</LEX><LEX>3 3.
</LEX><LEX>
</LEX><LEX>-
</LEX><LEX>M
</LEX><LEX>Lf</
6</LEX><LEX>*</LEX><LEX>,
LEX><LEX>rn</LEX><LEX>,</LEX><LEX> 3 1 </LEX><LEX>
, </LEX><
</LEX><LEX>n, </LEX><L
</LEX><LEX>
/E</LEX><LEX>X</LEX><LEX>
EX>k2+J</LEX><LEX> </LEX><LEX>jA4)-V-</LEX><LEX>jyr/EX<E> </E>LX2</LEX><LEX>W$Ij</LEX><LEX>A Tfj
</LEX><LEX>fT</LEX><LEX>f</LEX><LEX>$
i</LEX><LEX>L2fJ</LEX><LEX>ITk</
LEX><LEX>$ A</LEX><LEX>353</LEX><LEX>9cj</LEX><LEX>tc+ </LEX><LEX>,
</LEX
><LEX> 4 </LEX><LEX>. </LEX><LEX>e </LEX><LEX>E 5</LEX><ENAMEX
TYPE="PERSON"><LEX>*f4 </LEX></ENAMEX><LEX>4W</LEX><LEX>
(</LEX><LEX>RM</LEX><LEX>2f
</LEX><LEX>rJ</LEX><LEX>)
X><LEX> H</LEX><LEX> E</LEX><LEX>
>B
V</LEX><LEX>, </LEX><LEX>jZn</LEX><LEX>,
/LEX><LEX>,
</LEX><LEX> 1 3 </LE
@</LEX><LEX>WI l</LEX><LEX>.
</LEX><LEX>NM</LEX><LEX>iMzt</LEX><LEX>
LEX>V</LEX><LEX>,
: </LEX><LEX>
5JjZ</LEX><LEX>,
</LEX><LEX
</LEX><LEX>7
</LEX><LEX>{fy</LEX><LEX>T-</LEX><
</LEX><LEX>*</LEX><LEX>E
</LEX><LEX>W</LEX><LEX>JM<
</LEX><LEX> r A</LEX><LEX>f]</LEX><LEX>Rfj)I</LEX><LEX> T </LEX><L
EX>RM3- g</LEX><LEX>2%</LEX><LEX>BTI</LEX><LEX> I1</LEX><LEX> 21</LEX>
<LEX>@<X><LEX>
)V
</LEX><LEX>
</LEX><LEX>Utj</LEX><LEX>.
</LEX>
<LEX> (</LEX><ENAMEXTYPE="PERSON"><LEX>I)p</LEX></ENAMEX>
<LEX>M</LEX><LEX> 1 8 </LEX><LEX>/</LEX><LEX> 7 </LEX><LEX>)
C>
A3
</LEX></TXT></DO
Sample segmentation output from three word model
<LEX>
001
raw/950715/tg/gnxw0718.15
1995 07 18
Author: Daniel Hu
MIT-EECS
Page 79
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
<LEX>VT (4-]</LEX><LEX>*-t</LEX>
Domestic News Dept.
<LEX> 1 </LEX><LEX>.
</LEX><LEX>js - jE1 </LEX><LEX>
</LEX><LEX>t±</LEX><LEX>
</LEX><LEX>fJW</LEX><LEX> : </LEX><LEX> 9 5 </LEX><LEX> Y M,</LEX><LEX>L4j j</
LEX>] -4
X><LEX>,
</LEX><LEX>R fe</LEX><LEX>xM</LELEX>LEX></LEX><LEX>.)j;</LE
</LEX><LEX> 7 </LEX><LEX>yq </LEX><LEX> 1 6 </LEX><LEX> H </LEX><LEX>
t</LEX><LEX>-
t</LEX><LEX>E</LEX><LEX>&</LEX><LEX>--</LEX><LEX>ff *</LE
X><LEX>. </LEX><LEX>A @</LEX><LEX>IR
W</LEX><LEX>, </LEX><LEX>"M </LEX><
LEX>, </LEX><LEX> H ;z</LEX><LEX>, </LEX><LEX>XM</LEX><LEX>,
),-</LEX><LEX>,
</LEX><LEX>OT-HafAJ</LEX><LEX>,
</LEX><LEX>WPA</LEX><LEX>4,M</
LEX><LEX>*</LEX><LEX>fn</LEX><LEX>fMX</LEX><LEX>
EX>W</LEX><LEX>A± </LEX><LEX>fU</LEX><LEX>l
</LEX><LEX>A-</LEX><LEX>j
</LEX><LEX>Ep
</LEX><LEX>ag</LEX><L
#</LEX><LEX>
TA</LEX><LEX> 6
T </LEX><LEX>91-xZ</LEX><LEX>.
Lj</LEX><LEX>#)_RR</LEX><LEX>I</LEX><LEX>r r
</LEX><LEX>L
R</LEX><LEX>)k*)c</LEX><LE
X>rp ,J</LEX><LEX>-'</LEX><LEX>, </LEX><LEX>A W</LEX><LEX>L '</LEX><LEX>
Wtj</LEX><LEX>Jj_</LEX><LEX>-Zfe</LEX><LEX>fn</LEX><LEX>I
<><LEX>P,fA</LEX><LEX>j</LEX><LEX>,
</LEX><LEX>O
</LEX><LEX> 2 </LEX><LEX>.
</LEX><
LEX>2 i</LEX><LEX>$</LEX><LEX>$)</LEX><LEX>g</LEX><LEX>M%</LEX><LEX
> : </LEX><LEX>tI
</LEX><LEX>-tL</LEX><LEX>*E</LEX><LEX>MJMF</LEX><LEX>
M</LEX><LEX>tA</LEX><LEX></LEX><LEX>
X><LEX>R
</LEX><LEX>O</LEX><LEX>1
</LEX><LEX>
5 )</LEX><LEX>
ifj</LEX><LEX>tit</LE
</LEX><LEX>&R</LEX><
LEX>jf r</LEX><LEX>, </LEX><LEX>l 4</LEX><LEX> H</LEX><LEX> I
</X><L<LEX>
</LEX><LEX>M</LEX><LEX>
</LEX><LEX>
><LEX>tj<LEX>-<LEX><LEX><
</LEX><LEX>nil2</LEX><LEX>M)</LEX><LEX>, </LEX><LEX>PiN</LEX><LEX> jjL</LE
*M
~f<LE>LX3~</LEX><LEX>ff
3 </LEX>,LEX>,</LEX><LEX>
LEX>. </LEX><LEX
X><LEX>7k </X><LEX>rcA</LEX><LEX>7</LE</X><LLEX>
</LEX><LEX>2TI</LEX><
LEX>. LEX><LEX>3<2</LEX><LEX>.
)</LEX><LEX>15</LEX><LEX>
X><LEX>*,</LEX><LEX>
</LEX><LEX>*eit</LEX><LEX>jV</LEX><LEX>
</LEX><LEX> : </LEX><LEX>t </LEX><LEX>%t</LE
2</LEX><LEX>f-</LEX><LEX>ff$ljd</LEX><LEX>rJ</LEX><
LEX>3.</LEX><LEX>2 5 0</LEX><LEX>, </LEX><LEX>p1 </LEX><LEX>2fjt</LEX><LE
X>, </LEX><LEX>1 5 </LEX><LEX> H</LEX><LEX>jMA,</LEX><LEX>M*</LEX><LEX>$</
</LEX><LEX>E.
</LEX><LEX>B</LEX><LEX>-</LEX><LEX>M,</LEX><LEX>h</LEX><
LEX>3 3. 6</LEX><LEX>*</LEX><LEX>, </LEX><LEX>jtRf</LEX><LEX>iMg,,</LEX><LE
X>) </LEX><LEX> 3 1</LEX><LEX>*</LEX><LEX>ft,</LEX><LEX> E-<LE>LE><
LEX><LEX>fAI' *</LEX><LEX>T-V</LEX><LEX>ft,</LEX><LEX>)
V</LEX><LEX>i,'R</LEX
><LEX>.
</LEX><LEX>jI2j</LEX><LEX>fJRf</LEX><LEX>r
LEX>r I
</LEX><LEX>tE~t</LEX><LEX>zik</LEX><LEX>fpXA</LEX><LEX>
LEX>$%i)</LEX><LEX>7*iT</LEX><LEX>.
Author: Daniel Hu
C</LEX><LEX>j8j
</LEX><
[T,</LEX><
</LEX>
MIT-EECS
Page 80
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
<LEX> (</LEX><LEX>_M</LEX><LEX>r </LEX><LEX>M</LEX><LEX>M</LEX><LEX>
1 8 </
LEX>/<LEX> 7 </LEX><LEX>)
A4
</LEX>
Sample segmentation output from Unigram model
<DOC><TXT>
<LEX>001 </LEX>
<LEX>raw/950715/tg/gnxwO7l8.15</LEX>
<LEX>1 995</LEX>
<LEX>07</LEX><LEX>1 8</LEX><LEX>
<LEX>fT f-]</LEX><LEX>*-t</LEX>
<LEX>Domestic</LEX><LEX>News</LEX><LEX>Dept.</LEX>
<LEX> 1 </LEX><LEX>.
</LEX><LEX>- e </LEX><LEX>i W</LEX><LEX>, tth</LEX><LE
X>fgq</LEX><LEX> : </LEX><LEX>9 5</LEX><LEX>$L</LEX><LEX>E</LEX><LEX>L2j</
LEX><LEX>@MR</LEX><LEX>NG</LEX><LEX>M
</LEX><LEX>*</LEX><LEX>,Ai;</LE
X><LEX>, </LEX><LEX> 7 </LEX><LEX>-j</LEX><LEX> 1 6 </LEX><LEX> H</LEX><LEX>
</LEX><LEX>Vf</LEX><LEX>, </
</LEX><LEX>E </LEX><LEX>1
t<EX>-:Iaf*t
LEX><LEX>* r-</LEX><LEX>ff.5T Wf</LEX><LEX>, </LEX><LEX>"M</LEX><LEX>, </LEX
><LEX> H *</LEX><LEX>,
>
</LEX><LEX>
</LEX><LEX>,
jM
</LEX><LEX>,
</LEX><LEX>E lB t</LEX><LEX
fJE</LEX><LEX>, </LEX><LEX>@m</LEX><LEX>4</LEX><LEX>
<
K</LEX><LEX>@</LEX><LEX>a</LEX><LEX>W</LEX><L
I</LEX><LE
</LEX><LEX> 6 T
EX>A i</LEX><LEX>T</LEX><><LE#VlAW</LEX><LEX>,
/LEX><LEX>f</LEX><LEX>
-</LEX><LEX>
X>A #</LEX><LEX>fn- T </LEX><LEX>f
X</LEX><LEX> )1_j</LEX><LEX>-j2</LEX><LEX>$$N,/
</LEX><LEX> LiE</LEX><LEX>
</LEX><LE
</LEX><LEX> *
</LEX><LEX>Fft1</LEX><LEX>
J
LX</LEX><LEX>n </LEX><LEX>-r</LEX><LEX>,
X>
</LEX><
LEX><LEX> </LEX><LEX>Mit</LEX><LEX>-ZfL</LEX><LEX>t</LEX><LEX>
f</LEX><LEX>. </LEX><LEX>
LEX> W</LEX><LEX> </LEX><LEX>p R </LEX><LEX>iE
2</LEX><LEX>. </LEX><LEX>t' -I </LEX><LEX>Lt</LEX><LEX></LE</LEX><LEX>W<
</LEX><LEX> : </LEX><LEX>$ 3</LEX><LEX>-L</LEX><LEX>) A</LEX><LEX> -j<F</
</LEX><LEX>nij</LEX><LEX>ii5t</LEX
LEX>LEX>j</LEX><LEX> y</X<LEX>1>
><LEX>R </LEX><LEX> </LEX><LEX> 1 5 H</LEX><LEX> </LEX><LEX>4R</LEX><
LEX>r</LEX><LEX>, </LEX><LEX>1 4</LEX><LEX> H </LEX><LEX>_-</LEX><LEX> <
/LEX><LEX>gj</LEX><LEX>k;-</LEX><LEX>k </LEX><LEX> 3</LEX><LEX> <</L
EX><LEX>Pi </LEX><LEX>M</LEX>EX>, </LEX><LEX>RI</LEX><LEX> tI</LEX>
<LEX>MFT</LEX><LEX>I-C</LEX><LEX>*T </LEX><LEX>r@</LEX><LEX>),MM~h</LEX><LE
X>. </LEX><LEX> 3 </LEX><LEX>. </LEX><LEX>12 - </LEX><LEX>1P @</LEX><LEX>3:l*
;</LEX><LEX>fM-%</LEX><LEX>: </LEX><LEX>rt 3</LEX><LEX>%tL</LEX><LEX>k3A</
j</LEX><
LEX><LEX>j2I</LEX><LEX></LEX><LEX>W$Ij</LEX><LEX>&j</LEX><LEX>
LEX> 2 5 0 </LEX><LEX>nq</LEX><LEX>T-</LEX><LEX>k2_fj</LEX><LEX>,
Author: Daniel Hu
MIT-EECS
</LEX><LE
Page 81
5/29/01
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
X> 1 5 </LEX><LEX> F1 </LEX><LEX>jHj</LEX><LEX>M A</LEX><LEX>mi524Ij
</LEX><LEX>
</LEX><LEX>B2</LEX><LEX> -</LEX><LEX>MIl j-</LEX><LEX>hj</LEX><LEX> 3 3.
</LEX><LEX>j_</LEX><LEX>R1ij</LEX><LEX>j,-</LEX><LEX
LEX><LEX>*</LEX><LEX>,
>t,</LEX><LEX> 3 1 </LEX><LEX>f</LEX><LEX>
EX><LEX>tA
</LEX><LEX>jE-Vt</LEX><LEX>K</L
X>W</LEX><LEX>r
Vj </LEX><LEX>MIf</LEX><LE
S </LEX><LEX>itAJL</LEX><LEX>T-</LEX><LEX>
E>LX;
'-</LEX
</LEX><LEX>j
><E>
4<</LEX><LEX
)K</LEX><LEX
_</LEX><LEX>i*</LEX><LEX>1t-</LEX><LEX> : </LE
</LEX><LEX
/LEX>
<,T</LEX><LEX>R</LEX><LEX>)
X>< 3
</L
</LEX><LEX> 4 </LEX><LEX>,
>MJ f</LEX><LEX>%2</LEX><LEX>*T</LEX><LEX>.
EX><LEX>3
</LEX><LEX>jir</LEX>
</LEX><LEX>I f</LEX><LEX>Oj</LEX><LEX>
</LEX><LEX>V rIJ</LEX><LEX>WJ</LEX><LEX>j
<LEX>.
6 </
> 1 3 </LEX><LEX> H</LEX><LEX>r±</LEX><LEX>TyP</LEX><LEX>R $l</LEX><LEX>. </L
EX><LEX>i>
</LEX><LEX>j,</LEX><LEX>j6
<LEX>? @</LEX><LEX>, </LEX><LEX>-ZV</LEX><LEX>,
</LEX><LEX>*</LEX><LEX>M, jF</LEX><LEX>@</LEX><LE
</LEX><LEX>V</LEX><LEX>,
X>JEWx</LEX><LEX>,
</LEX>
</LEX><LEX>,
</LEX><LEX>)M </LEX><LEX>3Mj4</LEX><LEX>
</
</LEX><LEX>neXA</LEX><LEX>f]</LEX><LEX>Vrl</LEX><LEX>-T
LEX><LEX>#A fg-Eg</LEX><LEX>'/-J</LEX><LEX>ixLTI</LEX><LEX>rpM</LEX><LEX>'*
</LEX><LEX>@</LEX><LEX>Ij</LEX><LEX>
</LEX><LEX>)-tJ</LEX><LEX>.
</LEX>
Sample segmentation output from PKU segmenter
A5
<DOC>
<METID> 001 </METID>
<ID> raw/950715/tg/gnxw0718.15 </ID>
<DATE> 1995 07 18 </DATE>
<HL> fT[-j-t </HL>
<AU> Domestic News Dept. </AU>
<TEXT>
<p>
1.
<S>4*/r #'/njEF /n t/nr t-t--/nr V-/v : /w 9 5 /mr rM/ns L/ns
i tf/n 0/v 'i it/a , /w 7 P/t 1 6 H/t E/p
</S><S>* n /v
2RV W/ns
f*4/nr t/v Lf-
/w -%M/ns , /w H */ns , /w
, /w WI/ns V/v M*/n V/c M[/n n,/u aV4/v W/n
*li/v
T/u f
0/u -/m
<LEX>. </LEX></S> 2.
0 /u MP-/n --Zf/n ffi]/n mIjt/v
<LEX>. </LEX></S> 3.
/v
</LEX>
±I/n n/c-W/r M A/n 6 F/m
$r±
/m
\/n
't'/f
/n )*g/n
f/in /p I-tj/a ro/u R &/n n/f /n
@/ns E5,JOr'/fi*/s
3'/v nj/u 1 5 H/m []/q
/w 1 4 H/t -I/v <LEX>. </LEX></S><S>L/r j/v j-'
/d */a , /w R
R/n f(/n
=M/ns , /w EPf/ns , /w iVJH f)iA/ns
<S>;2r/r--/n EWt/n $/nr $29/nr MS/v : /w
3j.T/v F/n tk/p T@/ns T/n P/j
jfC/n
f
4-/v <LEX>.
x/n <LEX>. </LEX></S><S> iELM/ns $A#/ns &/n !/v
-M/q , /w -F g/v L ,/a
AR$/b fAr/n,
-i/ns
#]/ns
iiL/IAt'/ns
I F/v
W/t MJT/n Ic-A/a tF/n rj/u ,'M,/n frL/n , /w 1 4 H/t RI/v
<S>;2r/r -'/njP-/n 4F/nr $A%/nr f94/v : /w r M/ns -,,JL
j/f I/v V/n TIi1Wj/v vJ j/a fj/u 2 5 0 /m []/q 41-/v jE-IRTJL/n,
/w 1 5 H/t ifAif/p
)/vf fis
5%2k/n
<LEX>. </LEX></S>
Author: Daniel Hu
MIT-EECS
Page 82
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
</TEXT>
</DOC>
Author: Daniel Hu
MIT-EECS
Page 83
5/29/01
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
Appendix B Selected Code
Sample spec file
B1
INCLUDE
general-specs.spec
* .collection
CHINESE-LDC
*.language
ZH
########################################################################
Preprocess
###
4*4*#
########################################################################
###
###4
NIL
prelembic.process?
"NE"
"ML-NE"
"ML-NE"
lisp.phases
ne.language=ZH.passes
ne. language=ZH.pass-order
rules/danhu-learned-2##ne-language=ZH.pass=ML-NE.rules
rules.lisp
ne.language=ZH.pass=ML-NE.rules
/afs/rcf/project/corporal/chinese/danhu/traininglog/rules-exp-44235.lisp
lisp.phase=NE.language=ZH.out-tags
lisp.phase=NE.language=ZH.tag=NR.start-tag
lisp.phase=NE.language=ZH.tag=NT.start-tag
lisp.phase=NE.language=ZH.tag=NS.start-tag
lisp.phase=NE.language=ZH.end-tag
B2
"NR NT NS"
"<ENAMEX\ TYPE=\"PERSON\">"
"<ENAMEX\ TYPE=\"ORGANIZATION\">"
"<ENAMEX\ TYPE=\"LOCATION\">"
"</ENAMEX>"
Sample learned rules file for phrasing
The following rules were learned for Organization Tagging
;; Rule #1
(def-phraser-rule
:LEXEME
:anchor
(:NOT (:WD
:conditions
(:LWD-1
:actions
(:CREATE
:EVAL #I'PART-OF-SOME-PHRASE?))
:IntI)
:LEX
:WD
:WD
:NT)
Rule #2
(def-phraser-rule
:conditions
(:PHRASE
:PHRASE-LABEL
(:LWD-1 :LEX-SUFFIX
:actions
"4"
:NT)
)
(:SET-LABEL :NT)
(:EXPAND :LEFT-1)
)
Rule #3
;;
(def-phraser-rule
(:PHRASE :PHRASE-LABEL :NT)
:conditions
(:LEFT-1 :P-O-S :InI)
:actions
(:SET-LABEL :NT)
(:EXPAND :LEFT-1)
)
;;
Rule #4
(def-phraser-rule
:LEXEME
:anchor
:conditions
(:NOT (:WD :EVAL #'PART-OF-SOME-PHRASE?))
Author: Daniel Hu
MIT-EECS
Page 84
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
(:RIGHT-1
:actions
(:CREATE
:LEX-SUFFIX
:WD
:WD
5/29/01
"k")
:NT)
Rule #5
(def-phraser-rule
(:PHRASE :PHRASE-LABEL :NT)
:conditions
(:RIGHT-1 :LEX-SUFFIX "Eli")
:actions
(:SET-LABEL :NT)
(:EXPAND :RIGHT-1)
Rule #6
(def-phraser-rule
:conditions
:actions
:PHRASE-LABEL
:LEX :InsI)
(:SET-LABEL :NT)
(:PHRASE
(:LEFT-1
(:EXPAND
:NT)
:LEFT-1)
Rule #7
(def-phraser-rule
:conditions
(:PHRASE
(:RWD-1
:actions
:PHRASE-LABEL
:NT)
:LEX "ODN&')
(:SET-LABEL :NT)
(:EXPAND :RIGHT-1)
Rule #8
(def-phraser-rule
(:PHRASE :PHRASE-LABEL :NT)
:conditions
(:LWD-2 :LEX-SUFFIX "2')
(:SET-LABEL :NT)
:actions
(:EXPAND
:LEFT-1)
)
Rule #9
(def-phraser-rule
(:PHRASE
:conditions
(:LEFT-1
:actions
:PHRASE-LABEL :NT)
:P-O-S :InsI)
(:SET-LABEL
(:EXPAND
:NT)
:LEFT-1)
Rule #10
(def-phraser-rule
:conditions
(:PHRASE :PHRASE-LABEL
(:RWD-1 :LEX :InrI)
:actions
(:SET-LABEL
:NT)
:IGNORE)
)
;;
Rule
#11
(def-phraser-rule
(:PHRASE :PHRASE-LABEL :NT)
:conditions
(:LWD-1 :LEX :011)
(:SET-LABEL :IGNORE)
:actions
Rule #12
(def-phraser-rule
:anchor
:LEXEME
:conditions
(:NOT (:WD :EVAL #'PART-OF-SOME-PHRASE?))
(:RIGHT-1 :LEX "Y~IA")
Author: Daniel Hu
MIT-EECS
Page 85
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
:actions
(:CREATE
:WD
:WD
5/29/01
:NT)
Rule #13
(def-phraser-rule
:conditions
(:PHRASE :PHRASE-LABEL :NT)
(:RIGHT-1 :LEX :ZHORGSUFFIX)
:actions
(:SET-LABEL :NT)
(:EXPAND
:RIGHT-1)
Rule #14
(def-phraser-rule
:conditions
:actions
(:PHRASE
(:LEFT-1
:PHRASE-LABEL
(:EXPAND
:LEFT-1)
:NT)
:InzI)
(:SET-LABEL :NT)
:LEX
Rule #15
(def-phraser-rule
:conditions
(:PHRASE
(:LEFT-1
:actions
(:SET-LABEL :NT)
(:EXPAND
:PHRASE-LABEL
:LEX :InsI)
:NT)
:LEFT-1)
Rule #16
(def-phraser-rule
:conditions
:actions
(:PHRASE
:PHRASE-LABEL
(:RIGHT-1
:LEX-SUFFIX
:NT)
"3/")
(:SET-LABEL :NT)
(:EXPAND
:RIGHT-1)
Rule #17
(def-phraser-rule
:conditions
(:PHRASE :PHRASE-LABEL
(:LWD-1
:LEX :IvI)
:actions
(:SET-LABEL
(:CONTRACT
:NT)
:NT)
:LWD-2)
Rule #18
(def-phraser-rule
:PHRASE-LABEL
:conditions
(:PHRASE
:actions
:LEX "i-6±>4")
(:RWD-1
(:SET-LABEL :NT)
(:EXPAND :LEFT-1)
:NT)
Rule #19
(def-phraser-rule
:conditions
(:PHRASE :PHRASE-LABEL :NT)
(:RIGHT-1 :P-O-S :1j1)
:actions
(:SET-LABEL :NT)
(:EXPAND :RIGHT-1)
Rule #20
(def-phraser-rule
:anchor
:LEXEME
:conditions
(:NOT
Author: Daniel Hu
(:WD :EVAL #'PART-OF-SOME-PHRASE?))
MIT-EECS
Page 86
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
(:RIGHT-l
:actions
B3
(:CREATE
5/29/01
:LEX "EDi~")
:WD
:WD
:NT)
Manually Written Rules for Organization Tagging
(def-phraser-rule
:lexeme
:anchor
(:wd :lex-suffix :ZHORGSUFFIX)
:conditions
(:wd :p-o-s (:InI :jI :111))
(:NOT (:wd :eval #'part-of-some-phrase?))
(:create :wd :wd :ORG-HEAD)
:actions
:traced? NIL)
(def-phraser-rule
:anchor
:phrase
(:phrase :phrase-label :ORG-HEAD)
:conditions
(:wd-any :lex-suffix :ZHORG_NEGSUFFIX)
:actions
(:set-label :IGNORE))
(def-phraser-rule
:phrase
:anchor
(:phrase :phrase-label :ORG-HEAD)
:conditions
(:wd-any :p-o-s :inl)
(:left-1 :p-o-s :In|)
(:left-2 :p-o-s :Inr)
(:left-3 :p-o-s :InsI)
(:expand :left-3)
:actions
(:set-label :ORGANIZATION))
(def-phraser-rule
:phrase
:anchor
(:phrase :phrase-label :ORG-HEAD)
:conditions
(:wd-any :p-o-s :In|)
(:left-1 :p-o-s :In|)
(:left-2 :p-o-s :InsI)
(:left-3 :p-o-s :InsI)
(:expand :left-3)
:actions
(:set-label :ORGANIZATION))
(def-phraser-rule
:anchor
:phrase
(:phrase :phrase-label :ORG-HEAD)
:conditions
(:wd-any :p-o-s :InI)
(:left-1 :p-o-s :IvnI)
(:left-2 :p-o-s :Inj)
(:left-3 :p-o-s :InI)
(:expand :left-3)
:actions
(:set-label :ORGANIZATION))
(def-phraser-rule
:phrase
:anchor
(:phrase :phrase-label :ORG -HEAD)
:conditions
(:wd-any :p-o-s :InI)
(:left-1 :p-o-s :InI)
(:left-2 :p-o-s :Inz|)
(:left-3 :p-o-s :Ins|)
(:expand :left-3)
:actions
(:set-label :ORGANIZATION))
(def-phraser-rule
:phrase
:anchor
(:phrase :phrase-label :ORG -HEAD)
:conditions
(:wd-any :p-o-s :InI)
(:left-1 :p-o-s (:|nsl :Inz ))
(:left-2 :p-o-s :InsI)
:actions
(:expand :left-2)
(:set-label :ORGANIZATION))
Author: Daniel Hu
MIT-EECS
Page 87
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
(def-phraser-rule
:phrase
:anchor
(:phrase :phrase-label :ORG-HEAD)
:conditions
(:wd-any :p-o-s :InI)
(:left-l :p-o-s :Inj)
(:left-2 :p-o-s :InsI)
(:expand :left-2)
:actions
(:set-label :ORGANIZATION))
(def-phraser-rule
:phrase
:anchor
(:phrase :phrase-label :ORG-HEAD)
:conditions
(:wd-any :p-o-s :InI)
(:left-1 :p-o-s :InzI)
(:left-2 :p-o-s :InsI)
(:expand :left-2)
:actions
(:set-label :ORGANIZATION))
(def-phraser-rule
:anchor
:phrase
(:phrase :phrase-label :ORG-HEAD)
:conditions
(:wd-any :p-o-s :InI)
(:left-1 :p-o-s (:IvnI :IbI))
(:left-2 :p-o-s :InsI)
(:expand :left-2)
:actions
(:set-label :ORGANIZATION))
(def-phraser-rule
:phrase
:anchor
(:phrase :phrase-label :ORG-HEAD)
:conditions
(:wd-any :p-o-s :InI)
(:left-1 :p-o-s :1n1)
(:left-2 :p-o-s :InzI)
(:expand :left-2)
:actions
(:set-label :ORGANIZATION))
(def-phraser-rule
:anchor
:phrase
(:phrase :phrase-label :ORG-HEA.D)
:conditions
(:wd-any :p-o-s :InI)
(:left-1 :p-o-s :In|)
(:left-2 :p-o-s :InI)
(:left-3 :p-o-s (:Inl :IvnI))
:actions
(:expand :left-3)
(:set-label :ORGANIZATION))
(def-phraser-rule
:anchor
:phrase
(:phrase :phrase-label :ORG-HEAD)
:conditions
(:wd-any :p-o-s :InI)
(:left-1 :p-o-s :IvnI)
(:left-2 :p-o-s :InI)
(:left-3 :p-o-s :InsI)
(:expand :left-3)
:actions
(:set-label :ORGANIZATION))
(def-phraser-rule
:phrase
:anchor
(:phrase :phrase-label :ORG-HEAD)
:conditions
(:wd-any :p-o-s :InI)
(:left-l :p-o-s :InI)
(:left-2 :p-o-s :InI)
(:expand :left-2)
:actions
(:set-label :ORGANIZATION))
(def-phraser-rule
:anchor
:phrase
(:phrase
:conditions
(:wd-any
(:left-1
(:expand
:actions
Author: Daniel Hu
:phrase-label :ORG-HEAD)
:p-o-s :InI)
:p-o-s (:InI :Intl))
:left-1)
MIT-EECS
Page 88
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
:ORGANIZATION))
(:set-label
(def-phraser-rule
:phrase
:anchor
:conditions
(:phrase
(:wd-any
(:left-1
:phrase-label :ORG-HEAD)
:p-o-s :InI)
:p-o-s (:InzI :InsI :Ijl))
(:expand :left-1)
:actions
(:set-label
:ORGANIZATION))
(def-phraser-rule
:anchor
:phrase
:conditions
(:phrase :phrase-label :ORG-HEAD)
(:wd-any :p-o-s :1j1)
(:left-1 :p-o-s (:Intl :InsI :Inl))
(:expand :left-1)
(:set-label :ORGANIZATION))
:actions
(def-phraser-rule
:lexeme
:anchor
:conditions (:NOT (:wd :eval #'part-of-some-phrase?))
(:wd :p-o-s :Intl)
(:create
NIL)
:actions
:traced?
:wd :wd :ORGANIZATION)
(def-phraser-rule
:anchor
:conditions
:phrase
(:phrase :phrase-label :ORGANIZATION)
(:left-1 :p-o-s :InsI)
:actions
(:expand
:left-1))
(def-phraser-rule
:anchor
:conditions
:actions
B4
:phrase
(:phrase :phrase-label :ORG-HEAD)
(:set-label :IGNORE))
MMS Segmenter
#!/usr/bin/perl
# Read in the lexicon
open(W RDS, "/afs/rcf/project/corporal/chinese/danhu/wordlist.txt") or die "Can't open wordlist\n";
while (<WRDS>)
chomp;
next if /A\#/;
$cwords{$_} = 1;
if (length($_) == 6) {
if (!exists($cwords{substr($_, 0, 4)})) {
$cwords{substr($_, 0, 4)) = 2;
}
}
if (length($_) == 8) {
if (!exists($cwords{substr($_, 0, 4))))
$cwords{substr($_, 0, 4)} = 2;
}
if (!exists($cwords{substr($_, 0, 6)}))
$cwords{substr($_, 0, 6)} = 2;
I
if (length($_) == 10) {
if (!exists($cwords{substr($_, 0, 4)}))
'7b
$cwords{substr($_, 0, 4)) = 2;
}
if (!exists($cwords{substr($_, 0, 6)})) {
$cwords{substr($_, 0, 6)} = 2;
}
if (!exists($cwords{substr($_, 0, 8)}))
$cwords{substr($_, 0, 8)) = 2;
Author: Daniel Hu
MIT-EECS
Page 89
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
close(W RDS);
sub addsegword {
($-) = shift;
next if /A\#/;
$cwords{$_} = 1;
if (length($_) == 6) {
if (!exists($cwords{substr($_, 0, 4)})) {
$cwords{substr($_, 0, 4)) = 2;
if (Iength($_) == 8) {
if (!exists($cwords{substr($ , 0, 4)}))
$cwords{substr($_, 0, 4)} = 2;
}
if (!exists($cwords{substr($_, 0, 6))))
$cwords{substr($_, 0, 6)} = 2;
if (length($_) == 10) {
if (!exists($cwords{substr($_, 0, 4)}))
7b
$cwords{substr($_, 0, 4)) = 2;
if (!exists($cwords{substr($_, 0, 6)})) {
$cwords{substr($_, 0, 6)} = 2;
if (!exists($cwords{substr($_, 0, 8)})) {
$cwords{substr($_, 0, 8)) = 2;
}
# Numbers
$numbers =0"O
$numbers
AbAi:d
9-7-O 1 2 3 4 5 6 7 8 9 AM";
for ($n = 0; $n < length($numbers); $n+=2) {
$cnumbers{substr($numbers, $n, 2)) = 1;
}
#Classifiers (to follow numbers)
$classif ier = " - i-, ,LV EfA M WV, 11
$classifier .="TT-pq A
A-*-tR Mk,
for ($n = 0; $n < length($classifier); $n+=2) {
R X4Id
qmft-IFil $$Jji -9f
ti%"
";
$classifiers{substr($classifier, $n, 2)) = 1;
# Wide ASCII words
$wascii = "a b
c
d e f g h
ij k
Im
n
o
pq r s t
uvwx
y z.
$wascii.="ABCDEFGH I JKLMNOPQRSTUVWXYZ-";
$wascii .= ;
for ($n = 0; $n < length($wascii); $n+=2) {
$cascii{substr($wascii, $n, 2)} = 1;
# Foreign name transliteration characters
$foreign
iA
="
$foreign .="E
$foreign .=
R?
g
* I
gf
g
gr$p
M %I~j4 U V$$r 2%3
tg
f
@$
M
RDtfH
b I;N3-T-31
ga
g
MWM
1 J+ f
gg
-
g
f"
g eg";,
$foreign
$foreign =
for ($n
=
"
leng
ii
*
iEft
0; $n < Iength($foreign); $n+=2) {
Author: Daniel Hu
EpftL
g g
;
MIT-EECS
Page 90
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
$cforeign{substr($foreign, $n, 2)} = 1;
}
#Chinese surnames
$surname = "?,
1-
'
$surname
.= "-Egf,
$surname
.=
T
'tMM
"
-"
1
-
E
fit$E9}%UItll5UZ'5
4NME
;
$surname
$surname
$surname
V MIJF
.="[
-R,EM
ct
fA -
*t 0
JTE:";
.="31-01,r
M *P-MQM& UV,AJ5
$surname.R-1T
$surname.="-T
$surname .=
P H44QAFBW WQR-W";
tM flf$ *24fENft~Ae
$surname .= "
F#
"!
$uncommonsurname = ".=4Ji*4J~fl@
for ($n = 0; $n < length($surname); $n+=2) {
$csurname{substr($surname, $n, 2)} = 1;
"
4r
VA- 1150";
5 NAME
±"; # 3
for ($n = 0; $n < length($uncommonsurname); $n+=2) {
$uncommoncsurname{substr($uncommonsurname, $n, 2)} = 1;
}
# Add in 2 character surnames; also add to lexicon so they'll be segmented as one unit
$csurname{"'tt"} = 1; $cwords{"tE$I1"} = 1;
$csurname{"TJj"'} = 1; $cwords"TNh"} = 1;
$csurname{"jM"} = 1; $cwords{"t4Mf"} = 1;
4"} = 1;
$csurname{"*4"} = 1; $cwords{"y
$csurname{"VXB"} = 1; $cwords{"RfPB"} = 1;
$csurname{"$ T"} = 1; $cwords{" tt±"} = 1;
$csurname{"Ifjt"} = 1; $cwords{"Jr"]} = 1;
$csurname{"Tfl"} = 1; $cwords{" fJ1"}= 1;
$csurname{"=ift"} = 1; $cwords{"=i4M"} = 1;
$csurname{"P e "} = 1; $cwords{"4 a"} = 1;
$csurname{"itg"} = 1; $cwords{"iMgg} = 1;
$punctuation
.=
"
,
* [
0 ®~ [I
H -?
#Not in name
$notname
=
$notname .= $punctuation;
for ($n = 0; $n < length($notname); $n+=2) {
$cnotname{substr($notname, $n, 2)} = 1;
}
sub addChineseNames
($tmpline) = @-;
$tlen = length($tmpline);
$newline = "";
for ($m = 0; $m < $tlen; $m++) {
$tchar = substr($tmpline, $m, 1);
$currtoken = "";
if ($tchar =~ /A\s$/) {
$newline.= $tchar;
else {
$currtoken =
Author: Daniel Hu
MIT-EECS
Page 91
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
while ($tchar
!~ /\s$/
5/29/01
and $m < $tlen)
$currtoken .= $tchar;
$m++;
$tchar = substr($tmpline, $m, 1);
if (defined($csurname{$currtoken}) or
defined($uncommoncsurname{$currtokenpd)) { # found a surname, see what follows
# go past following spaces
$tchar = substr($tmpline, $m, 1);
$spaces = "";
while ($tchar =- As/ and $m < $tlen)
$spaces .= $tchar;
$m++;
$tchar = substr($tmpline, $m, 1);
# Get next token
$tchar = substr($tmpline, $m, 1);
$currtoken2 = "";
while ($tchar ! As/ and $m < $tlen)
$currtoken2 .= $tchar;
$m++;
$tchar = substr($tmpline, $m, 1);
# go past following spaces
$tchar = substr($tmpline, $m, 1);
$spaces2 = ;
while ($tchar =~ As/ and $m < $tlen)
$spaces2 .= $tchar;
$m++;
$tchar = substr($tmpline, $m, 1);
# Get next token
$tchar = substr($tmpline, $m, 1);
$currtoken3 = ;
while ($tchar!~ As/ and $m < $tlen)
$currtoken3 .= $tchar;
$m++;
$tchar = substr($tmpline, $m, 1);
if (isChinese($currtoken2) and (length($currtoken2) == 2) and
(!defined($cnotname{$currtoken2) and
isChinese($currtoken3) and length($currtoken3)
!defined($cnotname{$currtoken3}))
== 2 and
$newline .= $cname[O] . $currtoken $currtoken2 . $currtoken3 $cname[1];
#$cwords{$currtoken $currtoken2 $currtoken3} = 1;
#$cwords{$currtoken $currtoken2} = 2; # short version for checking
} elsif (isChinese($currtoken2) and (length($currtoken2) == 2)
and (!defined($cnotname{$currtoken2})))
$newline .= $currtoken . $currtoken2 . $spaces2 . $currtoken3;
$cwords{$currtoken . $currtoken2} = 1;
} elsif (defined($csurname{$currtoken}) and
isChinese($currtoken2) and (length($currtoken2) == 4) and
($cwords{$currtoken2} != 1) and
(!defined($cnotname{$currtoken2})))
$newline .= $cname[O] . $currtoken $currtoken2. $cname[1] $spaces2. $currtoken3;
$cwords{$currtoken . $currtoken2} = 1;
$cwords($currtoken . substr($currtoken2, 0, 2)} = 2; # short version to check
} elsif (defined($uncommoncsurname{$currtoken}) and
isChinese($currtoken2) and (length($currtoken2) == 4)
and (!defined($cnotname{$currtoken2}))
and ($cwords{$currtoken2) != 1))
$newline .= $cname[0] . $currtoken . $currtoken2 . $cname[1] . $spaces2 . $currtoken3;
$cwords{$currtoken . $currtoken2} = 1;
$cwords{$currtoken . substr($currtoken2, 0, 2)} = 2; # short version to check
else {
Author: Daniel Hu
MIT-EECS
Page 92
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
$newline .= $currtoken $spaces $currtoken2 $spaces2 $currtoken3;
else
$newline .= $currtoken;
$m--; # reset so won't skip space
$newline;
#sub cwordstart
# my($tword) =
# if (grep(/^$tword/, @cwordlist) > 0) {
return 1;
}else {
return 0;
#
#
#
#}
#}
sub isChinese {
my($cchar) =
for ($b = 0; $b < length($cchar); $b++) {
if (unpack("C", substr($cchar, $b, 1)) < 128)
return 0;
return 1;
}
sub allnum {
($Iocalnum)
=
# Need this if?
if ($Iocalnum =~ m/[0-9][0-9]*(\.[0-9]+)?/) {
return 1;
}
for ($k = 0; $k < length($Iocalnum); $k+=2) {
if (!defined($cnumbers{substr($ocalnum, $k, 2))))
return 0;
return 1;
}
sub allnumbers {
my($Iocalnum)
=
if ($Iocalnum =~ m/[0-9][0-9]*(\.[0-9]+)?/) {
return 1;
}
for ($k = 0; $k < length($Iocalnum); $k+=2) {
if (!defined($cnumbers{substr($ocanum, $k, 2)}))
return 0;
return 1;
sub allwasci {
($Iocalstr) =
for ($k = 0; $k < length($localstr); $k+=2) {
if (!defined($cascii{substr($ocalstr, $k, 2))))
return 0;
Author: Daniel Hu
MIT-EECS
Page 93
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
return 1;
}
sub allforeign {
($localstr) =
for ($k = 0; $k < length($localstr); $k+=2) _7b
if (!defined($cforeign{substr($localstr, $k, 2)})) {
return 0;
return 1;
}
sub segmentline( {
my($Iine) = @-;
$chinaccum ="
$outline = "";
$Iinelen = length($Iine);
for ($i = 0; $i <= $Iinelen; $i++) {
$charl = substr($Iine, $i, 1);
if (unpack("C", $charl) > 127)
$chinchar = substr($Iine, $i, 2);
if ($chinaccum eq "") {
$outline .= " " unless $i == 0;
$chinaccum = $chinchar;
_7d else (
if (defined($cwords{$chinaccum $chinchar}) and
$cwords{$chinaccum . $chinchar} == 1) { # is in lexicon
$chinaccum .= $chinchar;
} elsif (allnum($chinaccum) and (defined($cnumbers{$chinchar})
or defined($classifiers{$chinchar}))) {
$chinaccum .= $chinchar;
} elsif (allwascii($chinaccum) and defined($cascii{$chinchar})) {
$chinaccum .= $chinchar;
elsif (allforeign($chinaccum) and defined($cforeign{$chinchar}) and
defined($cwords{substr($Iine, $i, 4))) and
$cwords{substr($ine, $i, 4))
1 and
$cwords{substr($ine, $i, 4))
2) {
$chinaccum .= $chinchar;
} elsif (defined($cwords{$chinaccum . $chinchar}) and
($cwords{$chinaccum . $chinchar} == 2) and
defined($cwords{$chinaccum . $chinchar . substr($Iine, $i+2, 2)}) and
(($cwords{$chinaccum $chinchar . substr($line, $i+2, 2)) == 1) or
($cwords{$chinaccum $chinchar substr($Iine, $i+2, 2)} == 2)))
# starts a word in the lexicon
$chinaccum .= $chinchar;
else {
$outline.= $chinaccum .
$chinaccum = $chinchar; # start anew
$i++;
else { # Plain ascii text, attach any accumulated Chinese and then ascii
if ($chinaccum ne ") _7b
$outline.= $chinaccum .
$chinaccum =
$outline .= $charl;
$chinline = addChineseNames($outline);
$chinline;
B5
Translating wordlists from Web based dictionary
Author: Daniel Hu
MIT-EECS
Page 94
5/29/01
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
#!/afs/rcf/lang/bin/perl
srand;
#Build arrays of the words: [english word] [type] [chinese word]
open(INFILE, "< /afs/rcf/project/empty/data/dictionary/chinese/online/ntw.types")
or die
"can't open file";
$i=0;
while(<INFILE>){
chomp;
!-
($_
if
m/^\;/){
$listl{$_} = rand;
$list2[$i]
= $-;
$i++;
}
}
@list3 = sort{$listl{$a} <=> $listl($b}}
@list2;
foreach $enword (@list3){
print $enword. "\n";
}
#Prompt ciba-online for the Chinese word
sub pipecmd
{
my $cmd = shift(@_);
if (open(SENDOUT,"|$cmd"))
{
foreach $cmd (C_) {
print SENDOUT $cmd."\n";
}
close SENDOUT;
} else {
die;
}
#exit(0);
open(OUT,
">
/afs/rcf/project/corporal/chinese/danhu/zh-geo-world-provinces.txt")
or die
"can't open output file";
foreach $enword (@list3){
$enword
S/^\s+$//;
$wordlen = length($enword)+6;
pipecid
"/afs/rcf/user/jhndrsn/bin/nc webproxyl.mitre.org 80 > tempword.html",
"POST http://ciba.kingsoft.net/cgi-bin/ecdetail.cgi HTTP/1.0
"Referer: http://ciba.kingsoft.net/online/main.html",
"User-Agent: Mozilla/4.75 [en] (Xli; U; SunOS 5.7 sun4u)",
"Host: ciba.kingsoft.net:80",
"Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png,
*/*",
"Accept-Encoding: gzip",
"Accept-Language: en-US, en, en-GB, de, zh-CN, zh-TW, cs",
"Accept-Charset: iso-8859-1,*,utf-8"1,
"Content-type: application/x-www-form-urlencoded",
"Content-length: $wordlen",
"terms=$enword",
$newhtmlfile =
"/afs/rcf/project/corporal/chinese/danhu/ntwhtmlfiles/".$enword.".05082001.html";
print $newhtmlfile. "\n";
pipecmd (
"cp tempword.html $newhtmlfile",
Author: Daniel Hu
MIT-EECS
Page 95
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
exit(0);
print $enword."\n";
undef($/);
open(HTML, "< tempword.html") or die "unable to find tempword.html";
$htmltext = <HTML>;
close(HTML);
$chword = extractmeaning($htmltext);
$wordhash{$enword} = ["$chword", "countries"];
print $enword."\n";
sleep 1;
}
#Extract the Chinese word from the junk that is passed back and place it in the array
sub extractmeaning{
my($mytext) =
undef($/);
$mytext =- s/^.*\[f"dk\]//s;
if ($mytext =- m/\ (*5166|'2ia-2,'e6T\)/)
>2'&6t)\n";
#print " (*5[66| 'P6
return 0;
}else{
$mytext =_
S/A.*n\.//s;
$i=0;
$mychword =
while((substr($mytext, $i, 1) ne ",") and
(substr($mytext, $i, 1) ne ")") and
(substr($mytext,
$i, 1)
ne
"<")){
$mychword .= substr($mytext, $i, 1);
"$mychword\n";
#print
$i++;
#print substr($mytext, $i, 1)."\n";
$mychword =~ s/^\S+//;
#print "$mychword\n";
return $mychword;
#print $mytext;
}
}
#Print the resulting array however deemed most useful
foreach $enword (%wordhash){
array = @($wordhash{$enword}};
print $array[0]." ".$array[l]."\n";
printf(
"%15s%15s%20s\n",
$enword,
$array[0],
$array[1]);
}
Author: Daniel Hu
MIT-EECS
Page 96
5/29/01
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
Appendix C Sample Score File
C1
Evaluation output of the trained phraser
Parameters presumably in effect when rules were derived:
Host machine on which learner executed (excl::*current-site-name*)
= delphi.mitre.org
;;*RELEVANT-LOCS*=
LEFT-2 LEFT-1 WD-ANY WD-SPAN
RIGHT-1 RIGHT-2 LWD-i LWD-2
RWD-1 RWD-2
;;;*EXTERNAL -LOCS*=
LEFT-1 LEFT-2 RIGHT-1 RIGHT-2
RIGHT-3 LEFT-3
*COMPLEX-LOC-RESTRICTIONS* =
(2 RIGHT-1 RIGHT-2) (2 RIGHT-2 RIGHT-1) (2 LEFT-1 LEFT-2) (2 LEFT-2 LEFT-1)
(2 LEFT-1 RIGHT-1) (2 RIGHT-1 LEFT-1) (2 LWD-l LEFT-1) (2 RWD-1 RIGHT-1)
*COMPLEX-LOC-INCLUSION-PCTS* =
(1 1.0) (2 0.5) (3 0.2) (4 0.1)
(5 0.1) (6 0.1)
*RELEVANT-KEYS* =
NT-CONTRACT-RIGHT-i NT-CONTRACT-LEFT-i NT-EXPAND-LEFT-1 NT-EXPAND-RIGHT-i
;;;
NT-SEED
NT
*IRRELEVANT-ANSWERS* =
*ACTIVE-RULE-PREDICATES* =
VB-CLASS? NN-CLASS? NNP-CLASS? ADV-CLASS?
JJ-CLASS? PART-OF-GEO-PHRASE? PART-OF-HUMAN-PHRASE? PART-OF-ORG-PHRASE?
PART-OF-CORP-PHRASE? ELSEWHERE-AS-ORG? ELSEWHERE-AS-GEO? ELSEWHERE-AS-HUMAN?
ELSEWHERE-AS-CORP? PERSON-WORD? CITIZENSHIP-WORD? DATE-WORD?
CARDINAL-WORD? STATE-WORD? GEO-CONTINENT-WORD? GEO-THE-WORD?
GEO-MAJOR? GEO-ABBREV? COUNTRY-WORD? ORG-WORD?
TITLE-WORD? GAZ-WORD? SPOKESPERSON-WORD? SUBSIDIARY-WORD?
CORP-WORD?
*TRAINING-REGIME-NAME*
*INTERNAL.-LOCS*
= NIL
=
(:LWD-1
:LWD-2 :RWD-1 :RWD-2)
*RELEVANT-OBJ-TYPES* = (:LEX :POS :PREDICATE :WORD-LIST)
*ALIGNMENT-LOCS* = (:LEFT-1 :LEFT-2 :RIGHT-1 :RIGHT-2)
*SCORING-FUNCTION* = :F-MEASURE
*REFINEMENT-SCORING-FUNCTION* = :YIELD-MINUS-SACRIFICE
*F-MEASURE-BETA* = 0.8
*MAX-RULE-CONDITIONALITY* = 1
*USE-RULE-PREDICATE-INVERSES?* = NIL
*SET-COMPLEX-LOC-RESTRICTIONS-RANDOMLY?*
*SIMPLE-LOC-RESTRICTIONS* = :*
*PHRASER-LANGUAGE* = NIL
*ALIGNMENT-YIELD-OUTLIERS-FACTOR*
,,;
;
;
NIL
*IRRELEVANT-KEYS* = NIL
*RELEVANT-ANSWERS* = (:NT-CONTRACT-RIGHT-1 :NT-CONTRACT-LEFT-1
:NT-EXPAND-RIGHT-1 :NT-SEED :NT)
*MISALIGNMENT-RESOLUTION* = 3
;
,,;
=
:NT-EXPAND-LEFT-1
= 2
*TRAINING-SET-NAME* = "Unspecified"
*UNIVERSE-SIZE* = 300
*INITIAL-PHRASING-TYPE* = :NO-PHRASING
*MAX-RULE-CONDITIONALITY* = 1
*INITIAL-TEST-SENT* = NIL
*UNIVERSE-SIZE* = 300
*TESTING-SIZE* = 100
*TRAINING-SIZE* = 200
*CACHE-RULE-PREDICATES?* = T
*RESOLVE-MISALIGNMENTS?* = T
*RESOLVE-MISLABELINGS?* = T
*RESOLVE-INITIAL-PHRASING?* = T
*REJECT-REPEATED-RULES?* = 6
*AFFIX-LENGTHS* = (2)
*GENERATE-IGNORE-RULES?* = T
*NORMALIZE-ALL-STRINGS?* = T
*GENERATE-NUMERIC-COMPLEX-OBJECTS?* = NIL
*ALWAYS-PURSUE-REFINEMENT* = T
Author: Daniel Hu
MIT-EECS
Page 97
5/29/01
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
*ENABLE-GROWING-IN-EXPAND-PHRASE* = NIL
*PHRASES-FROM-PREPROCESSING* = NIL
*GENERATE-IGNORE-RULES?* = T
*EPOCH-DEPENDENT-CODE* = NIL
*RANDOM-TIE-ARBITRATION?*
= 50
*RANDOM-LEXEME-SCORING-RATE* = NIL
*RANDOM-SENTENCE-SELECTION-RATE* = 50
= *ACTIVE-RULE -PREDICATES-FOR-N*
*ACTIVE-RULE-PREDICATES-SOURCE*
;;
= "Unspecified"
*ORIGINAL-TESTING-SENTS-SOURCE*
= "Unspecified"
;; *ORIGINAL -TRAINING -SENTS-SOURCE*
;;;(length *all-sentences*) = 0
;;;(length *training-sentences*) = 16227
;;;(length *original-testing-sents*) = 2641
;;;(length
*original-trainingsents*) = 16227
Data generated on Thu
3-May-01 at 12:05:23 PM
Applying learned rules to TRAINING SET sentences.
Key
No.
--------
Ans
Corr
Miss
Sprr
R(%)[Lbl]
P(%)[Lbl]
3665
2655
2600
2531
2531
2440
2243
2158
2187
2109
2110
2129
2129
2083
2060
2006
1960
1920
1923
1901
1901
0
2070
1968
1873
2463
2366
2169
2076
2098
2019
1625
1301
1441
1384
1357
1303
1257
1217
1210
1188
1254
0.0
27.557981
29.058662
30.941338
30.941338
33.424286
38.799458
41.11869
40.32742
42.45566
42.42838
41.90996
41.90996
43.165077
43.792633
45.26603
46.521145
47.612553
47.530697
48.13097
48.13097
32.792206
35.11375
37.712006
31.52627
34.11306
39.599
42.059727
41.331097
43.524475
48.899372
54.1417
51.59557
53.33783
54.186363
56.009453
57.562458
58.912895
59.01084
59.7561
58.449303
F-measur Fired
+--------------------------------------------+-
0
3665
3665
3665
3665
3665
3665
3665
3665
3665
3665
3665
3665
3665
3665
3665
3665
3665
3665
3665
3665
3665
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
0
3080
3033
3007
3597
3591
3591
3583
3576
3575
3180
2837
2977
2966
2962
2962
2962
2962
2952
2952
3018
0
1010
1065
1134
1134
1225
1422
1507
1478
1556
1555
1536
1536
1582
1605
1659
1705
1745
1742
1764
1764
30.53
32.47
34.74
31.30
33.84
39.28
41.69
40.93
43.10
46.15
48.61
47.33
48.85
49.59
51.26
52.68
53.92
53.93
54.61
53.94
0
3080
372
346
590
597
319
245
502
104
395
343
140
170
55
66
47
178
118
34
66
Testing learned rules against independent test set of sentences.
No.
---------
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
I
Key
Ans
Corr
Miss
Sprr
728
560
557
548
548
525
497
483
497
482
482
482
482
475
474
471
468
463
465
460
0
355
346
330
428
405
377
363
377
362
308
277
296
287
286
283
280
275
276
271
I
R(%)[Lbl]
P(%)[Lbl]
0.0
23.076923
23.48901
24.725275
24.725275
27.884615
31.730768
33.653847
31.730768
33.791206
33.791206
33.791206
33.791206
34.752747
34.89011
35.302197
35.714287
36.4011
36.126373
36.813187
32.12237
33.075436
35.294117
29.605263
33.388157
37.993423
40.296055
37.993423
40.460526
44.40433
47.036327
45.38745
46.851852
47.037037
47.59259
48.148148
49.074074
48.794064
49.721706
F-measur Fired
+-------------------------------------------+-
728
728
728
728
728
728
728
728
728
728
728
728
728
728
728
728
728
728
728
728
0
523
517
510
608
608
608
608
608
608
554
523
542
540
540
540
540
540
539
539
Author: Daniel Hu
0
168
171
180
180
203
231
245
231
246
246
246
246
253
254
257
260
265
263
268
MIT-EECS
27.86
28.53
30.25
27.49
31.00
35.28
37.41
35.28
37.57
39.56
40.80
40.03
41.25
41.41
41.90
42.39
43.20
42.92
43.74
0
523
53
52
98
109
57
46
99
16
54
31
19
20
2
4
6
40
25
6
Page 98
5/29/01
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
728
C2
Evaluation output of the manual phraser
*
*
* SUMMARY
SCORES *
POS
*
ACTS COR PAR INC I MIS
SPU
NONI REC PRE UND OVG SUB ERR
-----------
organizatio
person
location
other
timex
date
time
other
numex
money
percent
other
I
I
I
I
enamex
801
0
0
0
9341 684
0
0
0
0
0
0
0
01
01
01
01
0
0
0
0
0
0
01
01
01
0
0
0
0
0
0
01
01
01
01
01
01
01
01
0
0
0
01
01
01
01
I
73
0
0
0
15
0
0
0
27
0
0
0
0
0
0
35
0
0
0
01
01
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
01
0
01
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
27
6
39
27
0
0
0
0
0
35
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
I
-----------
I
1602
I
186811292
0
761 234
500
01
81
enamex
numex
timex
9341
01
01
684
0
0
0
0
0
01
01
01
117
0
0
250
0
0
01
0j
01
85
0
0
I
I
0
0
01
01
01
01
76
85
0
0
65
73
0
0
15
15
0
0
27
27
0
0
11
0
0
250
250
0
0
0
0
42
35
0
0
0
0
0
0
0
0
0
0
01
01
01
01
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
01
01
01
01
0
0
0
0
0
0
0
0
01
01
01
01
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
761
234
500
0|
81
761 117
01 117
801
801
0
0
9341 608
9341 684
0
0
0
0
01
01
0
0
0
0
01
01
I
01
01
01
01
0
0
0
0
0
0
0
0
I
01
01
01
01
0
0
0
0
01
01
01
01
0
0
0
0
0
0
0
0
186811292
0
I
1602
0
I
-----------
P&R
74.47
F-MEASURES
Author: Daniel Hu
15
0
0
I
I
-------------------------------------------
ALL SLOTS
73
0
0
-----------
-------------------------------------------
SLOT SCORES
enamex
text
type
status
alt
numex
text
type
status
alt
timex
text
type
status
alt
15
I
I
801
0
0
69
-----------
------------------------------------------
OBJ SCORES
0
85
0
0
0
250
0
0
0
I
I
0
0
0
01
01
01
01
117
0
0
0
-------------------------------------------
SECT SCORES
doc
10
*
------------------------------------------
SUBTASK SCORES
|
43.31
48.81603
1 36.813187
281
460
268
549
20
MIT-EECS
15
69
2P&R
71.19
27
6 39
P&2R
78.06
Page 99
5/29/01
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
Appendix D Supplemental Wordlists for Phraser
Dl
Surname list
________
____
tPi
till
a
--
_QF Pal
Author: Daniel Hu
MIT-EECS
Page 100
5/29/01
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
R
T
3f
H
V
IM
1±9
Eh
Foreign phonetic translation characters
D2
~
A
7K
PU
V
Fq
ELt
rdf
-T
____
J--
±JIM
inTL
-4
t-i
4f
T
MAI-*
4:R
M,
V'
f4
62k
1
N4
_____
D3
__
__
__
__
±_
A JJ
_
__
_
_
Organization and Location Suffixes
Author: Daniel Hu
MIT-EECS
Page 101
5/29/01
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
Org
Org
9
'AA
Author: Daniel Hu
Org
Loc
Loc
tA
FJV$
i4
il A
J#
tf
t
MIT-EECS
Loc
Page 102
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
Appendix E Sample Training Data and Other Pre-tagged Corpus
El
University of Pennsylvania Chinese Treebank project
<DOC>
<DOCID> CB302002.BFY 1094 </DOCID>
<HEADER>
<DATE> 1994-09-02 09:31:23 3B:spcOl 1 </DATE>
</HEADER>
<BODY>
<HEADLINE>
((IP-HLN (NP-SBJ (NP-PN (NR rP)) (NP (NN RK)) (NN Wf) (NN *Ak)))
</HEADLINE>
<HEADLINE>
((FRAG (NR
$1j4P±)(NR 1-i) (NT 9 ,q) (NT 2 H) (NN ,) (NN 9) (PU :)
*R) (VV %
</HEADLINE>
<HEADLINE>
h Pa))
(PU , ) (NN MJZ-)
(NR -EfJ,)
))
(NP (NN
RNA)
(PU,
(ADVP (AD E)) (VP (VV
EM4)))
(DEG Vj)) (NP (NN *11)))
1K5r))
(CC
(NN
Wf) (NN
$R))) (VP (VP (VV
M()) (QP (CD T) (CLP (M 4))) (NP (NN
2))) (PU. ))) </S> <S> ( (IP (NP-SBJ (DNP
%(
(NP-PN (NR
jj
(NR rPM) (NN
) (NN
N))) (NP
(VV
3(*2))
RA
) (NN
W)
(NN
)
( (FRAG (NR V%$$1) (NN i2t)
</HEADLINE>
<TEDNPT>
<P>
<S>
((IP (NP-SBJ (NP-PN (NR rf))
(NN rj)
(VP (VV
&)
(NP (NP (NN -M)) (NP-PN (NR
(PU,
) (AS -) (NP-OBJ (NP-APP
Fr&)))) (PU,
) (VP (NP-TMP (NT
(NP-PN (QP (OD MA) (CLP (M
Tii-)) (NP (NN 1[K)))) (NP (NN
) (VP (ADVP (AD )j)) (ADVP (AD i))
RK))
(NN
(VP (PP-BNF (P -) (NP (PN 4t))) (VP
A) (NP-OBJ (CP (WHNP-1 (-NONE- *OP*)) (CP (IP (NP-SBJ (-NONE- *T*-1)) (VP (VA f))) (DEC Ij))) (NP (NN
CIM)))) (PU .
</P>
) </S>
<P>
<S>
((IP (DP-SBJ (DNP (NP (NN
RK))
(NN fV) (NN *2I)
(NN
RM)) (DEG
Rj)) (DP (DT t) (QP (CD -) (CLP (M t)))))
(VP (ADVP (AD 9)) (VP (VV ?-FTf) (NP-OBJ (DNP (NP (NN JZ) (CC f[a) (NN KJf)) (DEG V)) (NP (NN
j V)))) (PU ,
</P>
ML)
(NN
))</S>
<P>
<S>
((IP (LCP-TMP (IP (NP-PN-SBJ (NR P$) (NN AR) (NN AJf[])) (VP (VV
%))
p)~Qj))
(LC ,)7}))
(NP (NN lip))) (VP (ADVP (AD qN)fl) (VP (VV L)) (NP-OBJ (NP (NN ?±h)(NN
?All)
(NN
(PU,
(NN -zlm) (PU, ) (NN *X)) (NP (NN
)
(PU, )) ) </S>
<S>
((IP (IP (NP-SBJ (NN jgfil) (NN TE-)) (VP (VV f)F-) (NP-OBJ (NP-APP (NN y-M) (NN f4J) (ETC
i4))))) (PU,
) (IP (NP-SBJ (NN
S4t)) (VP (VV
9Ti) (NP-OBJ (NN f:lt))))
(PU,
) (NP-SBJ (DP (DT
51)) (CC
[) (NP (NP
%)) (NP
(NN
)) ) </S>
<S>
((IP (PP (P f yt) (NP (DNP (NP (NN
RNA) (NN f4z)
RAy)
(NN -$R))(VP (VV
(NN *2N))
(DEG C,)) (QP (CD -))
#f4M) (VP (VV 4W)))
(PU,
(NP (NN nf4)))) (PU , ) (NP-SBJ (NN
)) ) </S>
</P>
<P>
<S>
((IP (NP-SBJ (NN r*))
(VP (DP-TMP (DT Z) (CLP (M *)))
(ADVP (AD a)) (VP (VV
SBJ (-NONE- -PRO*)) (VP (VV Mf) (PP (P -f)(IP (NP-SBJ (-NONE- *pro*)) (VP (VV Jff)
M)
(NP-OBJ (NN
(NP-OBJ (NN
#;))(IP (NPRWA) (NN
</S>
<S>
Author: Daniel Hu
MIT-EECS
Page 103
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
((IP (NP-SBJ (DNP (ADJP (JJ k~tI)) (DEG 1J)) (NP (NN
Et4 (It) (NN /fjg))) (VP (VV #Fe)
HMA) (NN
(NP-OBJ (NN *,))))(PU,
ff)
(NN jt)))
5/29/01
(VP (PP-TMP (P -T) (NP (NT
)))
</S>
<S>
((IP (NP-SBJ (NN
</S>
RM)) (VP (VP (VV fi)
) (VP (VV IRh) (NP-OBJ (NN -JW
(NP-OBJ (NN 1%j%))) (PU,
)))) (PU,
)))
<S>
((IP (NP-SBJ (NN R&X) (NN f$)
(NN WI)) (VP (ADVP (AD AJ)) (VP (VV
iRK)))
(PU.
)) ) </S>
</P>
</TEDNPT>
</BODY>
</DOC>
E2
Beijing University Institute of Computational Linguistics People's Daily
project
19980101-01-001-001/m 32 t)/v 3fj/v tT/n 1'A/u V/a ft&2/n
-/w -tNtRTE/t
fj*f/t i/t4/n
(/w [J/v R]1)t/n
1 /m I/q ) /w
19980101-01-001-002/m *rpj$%/nt jip,
iF/n , /w M*/n
/n
/r
/r
(/w -)L)uL-L*/t +---^ j /t ---- H/t ) /w
19980101-01-001-004/m 1 2fl/t 3 1 H/t , /w tMr%nt ,4 jE/n , /w M
/n Mj:/n E/nr fMi/nr "Q/v
1 9 9 8*/t f */t f~tij/n (/w )E j/v 3fr/v tM/n
j/u f/a ft2/n ) /w , /w (/w WjRp/nt jE2/n M/nr
fEf/nr fiNg ) /w
19980101-01-001-005/m Fvl 2/n fiD/k , /w NM /n fl/k , /w ±in ffl/k , /w tt/n
flF/k : /w
19980101-01-001-006/m 7±/p 1 9 9 8 -/t *1/v 2VTh/f , /w V/r +±4/m Aj'/a M/u 3Aj4/p [rpq/n Af/n
yf/vn *9/n]nt , /w [tp/ns E[j/n T-0/vn t-/n]nt ff/c [tI/n 4tWP/n]nt, /w
rj/p A [/n R/r AR/n
/w Frq/p [WA/ns J IJ/a fiMI/n]ns rvn/n , /w irt-/ns fI/c
I-/ns r[ll /n
/w if-I/s inh/n , /w r',/p
ftW/n &-f/r ro/u M /n i1/k , /w RJ;A/v i)J/a 0j/u FnoR/vn fa/c af/a nJ/u $5-/vn ! /w
19980101-01-001-007/m 1 9 9 7 F/t , /w i/v 't
/ns RI/vn M t/n I/f 4F,/d _I 7/a ftj/u JR/d T/d TYL/a
1Yj/u -/m */q , /w tF[/ns A
/n ZlC/d J*/v y$/nr /lA/nr MrIC/n OJ/u Mi$/n , /w g/v fp/p EAJ/v ;/v
19980101-01-001-003/m
1P/ns J$1t/n
±-1_X/n *L/n *Fr /v (i3 /v
/w [r'F/ns i[jj-/n]nt Rpftj/ad Fkk/v Xf/p Wt/ns fif*/v it7/n
"/w -MM$1i/j
"/w
, /w "/w iAA/I "/w , /w A. q/d lfJ/v j/u #J-/n Rflf/v W /ns jR/u
/w #/c ,M/p
%50/an Mt/an , /w [t[/ns JUM/n]nt frfj/a it/u Bf+/v -/u
±+H/m R/q -M,-/n
, /w A#/v
E/J\TM/n f1 /a )54P/n , /w ,
f/v -i/m gjt/n , /w
RT/v fj/a 0J/u t22/n , /w $ij]/v T/u 4ifins
,
P-/v ft 2/n Q-)/v /u ffj /vn %q))/n . /w
19980101-01-001-008/m r/p Z/r -/m */q r/f , /w r11g/ns Cj/u Ekg/vn f){/vn fo/c 3JftJL/vn lti/vn 9Vq/v
Fjglij/v
LA/v, /w NRfQV/n fM
/v T/u "/w A/a it-K/vn , /w f/a AM/j"/w WJ/u AR/a &M/vn 3!/n , /w
'K/n tfj,/vn XR/d RJf/v f/a 0j/u 4}(y /n , /w LVn 2kT*/vn ?[AK/v RjL/v , /w )
/n tit/vn 3A-t/d
21*/v . /w x4$l/vn /,Mj/n 1J*/n 6-fF/vn 4/c Zjp{/vn TF/d If'-/v . /w R]5/a f~iJ/n Ej/vn
, /w
M$$ZAn-/n xLt/vn fU/c ft/r
-Ji/r k/n X/d ;/v Wj/a 0/u MA/vn , /w
fill/r +4/m MjI/v fRA/t -t/m
RTfl/n
- /m *%/n fo/c Jtf[/n
t/v In/u
&M/n 3/n , /w
ffl/r tfff/v AA/p A /r M*/n @f/c *Il-/n
j/u rhf/an J;) !/c qM/v aj/u MjT,/n 6ff/vn , /w Mt /n -/v At'/d 4IJ/v WV(/vn , /w M j~iM/c, /w
rP/ns A*/v fn/c &)F/v( ,/u
JP/ E /v Rf4/v Tf/u V, /an . /w
19980101-01-001-009/m t/p A/r -/m */q r/f , /w tg/ns fj/u +/n i -/vn
W& /v T /u *3/a jj%/n , /w
A Lf/p A.jG/n T-hj/v , /w 't'H/ns 4/p M/ns , /w fRf/ns
/w AM/ns , /w H;/ns N/u kI/n Tf /v
T/u yRj/n y ./-/n k*/t &jR/v 0j/u
& 4RPM
/
O/u
i&)/vn , /w
A't/r
fi2/n 8
*-/a 6fF/vn iiT- /d
#jt/n , /w rPM/ns 4/p ThLJi/n M*/n fn/c r'/b
h)nd/v. 1w *R/ns VUR/ad *-/v []E/j /j
M/j /n]nt 0j/u
VT,/n f5/c Th:±/vn
lhP/v T/u -9/ns -/w r/j H/j 0f/j Wf/c 4M/ns -/w ,:/ns -o/n 4IE-/b -u4/vn , /w
/vn , /w
f-v fT/n 4/c Rt/v f/u EtIft/n MN/n , /w 1Ills/v ftW/n j rq/v : ,tRJE/v In/u
Author: Daniel Hu
MIT-EECS
Page 104
1
5/29/01
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
+M/n , /w xI4P/p
QjA/v NBjT/n
$±,/n fj/u &cf/a -ftff/vn 1/c )MF/b l/vn ffh/v
-f/u
111R/a 1Y/u
af/n
/w
1t11/u Rffj/v f/a @j/u )tkR/n . /w
19980101-01-001-010/m 1 9 9 8 */t , /w 4i /ns )AN/n V/d jf{/I
ffTlt/n /u
3v
/ /c 1fl/r :'F/p 4/3/n 4k/n c)/v 4i/f a/d Aj;[,/v Tj.//m WE/an , /w fR/c I{1/r
2M/v 9+4k/v Ti/a 2 0/m */q */f VM/v fjj/u {Mt/a p9±/n fo/c 17/v WJ/u 4 /a H/n
11,/vn , /w /v
, /w j/d W/v -+9t1/r j/u ;ft/r fIll/a *{4/n , /w ,jf/r -Z/d f /v -Jl/v ALt/r fI/an , /w lgg/v
/
/w 9111/v tKA1/n , /w ffCAV/1 , /w L/v
+/i,/w
g3I/I , /w Rr/c Rfj-l/r A-9/d E4
4Yn h1-tn jlu JM/n A/c l/v W/d W/v X/d ?tr/a . /w
t'ns
19980101-01-001-011/m 30N/v RX/n 1Y/u lj/a Y-/vn , /w j/v r$in/s ±/ /n 4i/ns A/n
W/v
j/u
Mi4j/b
{
/n /w
AiA/p q/j X/j 9V)/n W/u 9Tfh/vn 1/c 4R/an , /w Vj5/p "/w -FM$1j/j "/w
7ft/n f/c frK]/ns
/d %, /v PrVi/ad 5WJN/v . /w
(/w 4 :A/n ) /w , /w 1 9 9 9 4/t 1 2 fl/t jq /ns Wj/u H )-/vn 19980101-01-001-012/m
gns i/v 4i[/ns fi±/n T+5)-Wj/ O/u -/m %f/n . /w -/v ?fl/n p-/vn ,w
M/v AMTrg/i
G-L/ns
/w "/w
tY/fn
/c
,
A 4/n
/j "/w
$111+/v "/w A/m -'j/q rtR/ns "/w ,w "/w --4f4/n
I4/ns
fi/n
J7)/p
,9R/v
.
/w
tT/v
W5/d
jjt/v
3/v
"/w
0j/u
Ni*/n
,
/w
Jt/v
/w PiL'fr-Ij'/I 1w /w -/r cii f/v
/v , /w */v fh/v 11k/n , /w 4Mf/v
f4*/vn , /w QA/v At/n EAf/ad
I1/v Oj/u
R-/d Ft //v ThIjF/r
4;/a C/u ffj-i/vn , /w *4 1/v AjM/n fiP/n 4Th/n 4Zjif/vn
A/v , /w AM/v , /w AA/v j/u * H/d 4 !/v , /w #/c
/n/m rti/ns fj/u W_)Q]/n -F/f MjW/n Aff/v 4$1j/vn In/u /P1V/a ufSF/vn
E/p --
/w
19980101-01-001-013/m
4fi/v -A4Z/n , /w H /d
jtj9/a 0j/u ffn/n
jj/n I$*/vn
,
Wjl/n @/u tR/vn -N19±/v 9194/n f)IA/n . /w /A/c,
/w IEI/d M/p 6jf/r
/w pJL/f 14'1/f (9,j/u 2;9±/n ?Vpp/n 4914/v #t1/v ; /w 9.1 1/n 9145/vn t9±//
A 4/vn,
WT/a
4R/n
/w HV1r)]JP/i O/u
/w tfW/n AI/d T/d
/w T/d 'iIE/a T/d
rt/p *711/a 0/u
; /w &RFVt1+9/
t/a /u H6/a Wj/u MPTl,/n i&k9/n /M1/n #F1/n a/d &3/Wv Vr/a /vn
Rj/vn r1/f 0/d
-T/v 15y/n 1iib9±/n ; /w Aj,/n Wj/u h9/vn 4/c &H/vna/d 1ilr/v ##f1/q
MFT1/n P±9M/n
9 19/vn 1/c V1a/vn . /w f 111/n 4/c k)H/vn W/u 4r9/n I/v 41//a Wj/u , /w 2 1/m ft9//n /i /d I/v 9±V/v
tT/n O/u
±/H/n . /w f9/c 49±/v 4/u
&M/n T/d W/v t/d T/d 4/v -4*1LF)A/i , /w 491 /n 9±/v 9±/n
FT1/n i6/n
N /v /IE/a Ag/a 4/u
- /r
Ak/n f./v A-1/d R/ /a La*/v , /w J±M19/d 4)j/v * j/d
9±9h/n 9/a #F/n , /w
19980101-01-001-014/m [liM/ns RIf/n]nt M/d /914/v 9±4/v #ff/v jtA1 /i W/u 114/n A34/n 1+M/n , /w
H/p 1149/I
E/m
j /q W+31111/n 4/u
Ma/n
_f/f
9±,/i/ad QM/v 91/p (±W/n
Wl/r 4/u
f/a 44/n . /w
rPM/ns 9±9±/v huN9/v .1/p J(9+9±/nt f/c 44/r 9T/n /l94/n 4/u t),ii1/vn , /w QA/v 9E/p /t)/v 9±9±/j f4/n
3PL/vn , /w R4P/v qfj/n , /w r [;,/v ,±F/an , /w /44/v +11+/n J1W1/vn 4/u X+1/n 4/u +11+/n -6fM/vn . /w
rli9/ns 49±/d 9±/v /91P/v ftW/n f 111/n 4/c V±9±/an 4/u 9±9±/a 92/n , /w r+1/ns Ak/n 9±/v 4/p ftw/n
K+1/r \,/n -9/d , /w Y1/p 9f+aj/v rrX/a fqT/n , /w 4911/d 9M/v 4/u 9±/a 9±/H/n i/i/c Ttv 9±/
! /w
19980101-01-001-015/m 9±/p A/r n±194r// 4/u /ff/a 64/ilJ/n , /w R/r ?R/v t*/r V4:-/t 1t$19/a , /w *bl/n
4/ 9/a
! /w
19980101-01-001-016/m 9i+9/v
E3
! /w (/w T±P/$/nt JL9±/ns 1 2f-/t 3 1 H/t */n ) /w
MET 2 and BBN data
<DOC>
<METID> 001 </METID>
<ID> raw/950715/tg/gnxw0718.15 </ID>
<DATE> 1995 07 18 </DATE>
<HL> 9r'19±/ </HL>
<AU> Domestic News Dept. </AU>
<TEXT>
<p>
-
-<ENAMEX
TYPE-"LOCATION'>
Author: Daniel Hu
TYPE="PERSON">9
+W</ENAMEX> L
</ENAMEX>t
:% <TIMEX TYPE="DATE'>9 5</TIMEX><ENAMEX
<ENAMEXTYPE-"LOCATION'>f #</ENAMEX>W P
MIT-EECS
M$/V4
i , <TIME
Page 105
5/29/01
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
t*@ </ENAMEX><ENAMEX
) 1 6 H</TIMEX><ENAMEX TYPE="LOCATION">M
* n <ENAMEX TYPE="LOCATION">{t Wfr</ENAMEX>,
TYPE="LOCATION"> 1 i</ENAMEX> f,
X TYPE="DATE">7
TYPE="LOCATION">i0</ENAMEX>, <ENAMEX TYPE="LOCATION"> F</ENAMEX>,
<ENAMEX
<ENAMEX TYPE="LOCATION">Ep</ENAMEX>,
<ENAMEX
TYPE="LOCATION">g F</ENAMEX>,
itig @gg W
TYPE="LOCATION">Whfj3J</ENAMEX>, <ENAMEX TYPE="LOCATION">@@</ENAMEX> ,Il
W
Ai±l-W
6 f
#AXhn
T#
MiLh<ENAMEX
,
<ENAMEX
<ENAMEX
TYPE="LOCATION">MW</ENAMEX>R
TYPE="LOCATION">N</ENAMEX>,
!
1 5
<ENAMEX TYPE="LOCATION">t
@</ENAMEX>TL<ENAMEX
i-ir</ENAMEX>
<ENAMEX TYPE="ORGANIZATION">W j$3
3-iaMF</ENAMEX>
TYPE="ORGANIZATION">)
iNitt E3i
2<
@@</ENAMEX>4fg%:
ENAMEX TYPE="PERSON">
<TIMEX TYPE="DATE">1 4 H </TIMEX> ZT,
RUW
TYPE="LOCATION">@@</ENAMEX>i[-Jnvgct,
YI
± <ENAMEX
3.
T, cA*RCMC
41-Wi<ENAMEX
TYPE="PERSON">$@@</ENAMEX>fg-%: <ENAMEX TYPE="LOCATION">r M</ENAMEX>2L<ENAMEX
2 5 0 V-Eqf J1EV,
TYPE="ORGANIZATION">TIEON'~</ENAMEX>jy$llg
</TIMEX>3a
MJZ,
AfRt
6 *,
3 3.
L
Af
</ENAMEX>e
X<ENAMEX TYPE="LOCATION">r
3 1* n
<TIMEX TYPE="DATE'> 1 5 H
R
4.
VtIE@A
k1V%9EjT,
: (<ENAMEX TYPE="LOCATION">!R
TYPE="PERSON">jt*M$</ENAMEX>fg-
ji_U:<ENAMEX
</ENAMEX>7-R)
TYPE="DATE'>1 3 H </TIMEX>Af<ENAMEX TYPE="LOCATION">i#</ENAMEX>W
3 t,,
J)
%
VF$1'i@
<TIMEX
6.
r.]XIjf-]MN<ENAMEXTYPE="LOCATION">RI</ENAMEX>
V,
@-, 5. <ENAMEX
</ENAMEX>74@2%j
2-%T<ENAMEXTYPE="LOCATION">4
TYPE="ORGANZATION>@
t</ENAMEX>3P,: E ±<TIMEX TYPE="DATE">'5tI 1</TIMEX>
tji3vj,
<TIMEX
4<ENAMEX TYPE="LOCATION">P M</ENAMEX>&> 1,- []I A j @%
7
TYPE="DATE">'541$V
&
fT±</ENAMEX>iP&
WE--)tM
</TIMEX>-+
,L-4iPF4MinfAid<NUMEX
: [@4
TYPE="MONEY">bALrzi{Zi</NUMEX>.
<ENAMEX TYPE="ORGANIZATION">t
6.
@j<ENAMEX TYPE="ORGANIZATION">M
Mf-T</ENAMEX>
Xeif]MTLR,
<ENAMEX TYPE="LOCATION">*M</ENAMEX><ENAMEX TYPE="LOCATION">-[34</ENAMEX>j<ENAMEX
3.,
TYPE="LOCATION">'jlI</ENAMEX>
</ENAMEX>,
<ENAMEX
<ENAMEX TYPE="ORGANIZATION">M_±I- -1jA</ENAMEX>,
<ENAMEX TYPE="ORGANIZATION">%r2</ENAMEX>,
TYPE="ORGANIZATION"> H 4$LJ-F</ENAMEX>,
TYPE="ORGANIZATION">-
W7<NUMEX
,jTh ±
j-
3+%R
/cA]r
</ENAMEX>Nitb t%&ifje
TYPE="LOCATION">,'J'fl</ENAMEX>ffFK,
L-
_M<ENAMEX TYPE="ORGANIZATION">; [t±M
,
IJ<TIMEX TYPE="DATE">-'46A fl</TIMEX>, iIh
TYPE="MONEY">-
TYPE="ORGANIZATION">fT±</ENAMEX>A,:
-
5P<NUMEX>L,
E<ENAMEX TYPE="LOCATION">t
7.
-
h
%%-#
--*g,
MMW
,
{i
t 2R
-- >-F
,
iimtLyL. iE
</ENAMEX>X(NR
,gt,
3
TYPE="LOCATION">ijj
</ENAMEX>
-- 1M
iA.We,
</NUMEX>,
EX>LF
9.
-- ±
H </TIMEX>j-t4M3M
&
<ENA<ENAMEX
Pjzt<ENAMEX
-f<ENAMEX TYPE="LOCATION">Riiff</ENAMEX>
IM)NJd
1
,
",
<TIMEX TYPE="DATE">-V114</TIMEX><ENAMEX
i<TI[K.MII<TIMEX TYPE="DATE">--ift
</ENAM
<ENAMEX
A
±</ENAMEX>j*
TYPE="LOCATION">Riiij</ENAMEX>-Aihl,
TYPE="LOCATION">4'
I
<TIMEX TYPE="DATE">E-
8. <ENAMEX TYPE="ORGANIZATION">4$
,
H
-M
R
<ENAMEX
</ENAMEX><ENAMEX
TYPE="LOCATION">-~</ENAMEX><ENAMEX TYPE="LOCATION">fl$ </ENAMEX>nf i
TYPE="LOCATION">iT'</ENAMEX>3Z
<ENAMEX
E*'t]Egp<ENAMEX
%
<ENAMEXTYPE="ORGANIZATION">
4A)</TIMEX>, <ENAMEX TYPE="LOCATION">fij
ij) ] A
<NUMEX
TYPE="MONEY"-N-~ EY' >
WT±</ENAMEX>r A
T
:<ENAMEXTYPE="LOCATION">$5
</ENAMEX><ENAMEX TYPE="LOCATION"></ENAMEX>MX-ffrA§gi -- <ENAMEX
TYPE="LOCATION">MihgIIfa g fi4</ENAMEX> Hid
BM
jVlj,</ENAMEX>f
Author: Daniel Hu
2.T<ENAMEX
Rf,
<ENAMEX TYPE="LOCATION">MM ml
a
TYPE="LOCATION">M M</ENAMEX>*%<ENAMEX TYPE="LOCATION'>ft*2
MIT-EECS
Page 106
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
</ENAMEX>i#F',
gg<ENAMEX
TYPE="LOCATION">MMIA</ENAMEX>4--^- &W*fjA,
M, WM
-r
flXIR-48,
, f~e
TYPE="ORGANIZATION">±V-T</ENAMEX>fP,,:
:J
AylMqA!A
@
i,
<ENAMEX TYPE="LOCATION">1Lf</ENAMEX>
f,)</ENAMEX>
-jT
,
. <ENAMEX TYPE="LOCATION"> 4tmli
R
$JJfln
4
V-tVU,
NW1JPL,
1 O<ENAMEX
4M<ENAMEX TYPE="ORGANIZATION">A-T-IN
J
</ENEX>A) MEX>,
</NUMEX>,
<ENAMEX
$</ENAMEX>E ffW 2,
TYPE="LOCATION">t
5/29/01
rE/A.
M bjii@-Ef
R
t Ma-i {G<NUMEX
TYPE="PERCENT">-f
rfi
<ENAMEX TYPE="LOCATION">AL
J
|T0Ji.
-AA7
E7-^ 0 0 0 t<ENAMEX TYPE="LOCATION">L-T</ENAMEX>
/ENAMEX>~i~
i/ t
-31ithI]i,
#R @l/-F###R W R 955@lt9]R
_<NUMEX TYPE="PERCENT">-2ii-+</NUMEX>
<p>
(<ENAMEX TYPE="PERSON">M</ENAMEX>M<ENAMEX TYPE="PERSON">)i
</ENAMEX>M 1 8/7)
</TEXT>
</DOC>
E4
There
There
There
There
There
Frequency data from PKU People's Daily project
337724 words of the length 1.
492879 words of the length 2.
51116 words of the length 3.
19375 words of the length 4.
5059 words of the length 5.
are
are
are
are
are
POS
Total
1 Char
2 Char
3 Char
4 Char
Other
Ag
na
Mg
vd
321
1
7
1
528
321
0
7
0
17
0
0
0
1
501
0
1
0
0
10
0
0
0
0
0
0
0
0
0
0
nr
35262
18537
13927
1415
623
760
Yg
1
1
0
0
0
0
a
ns
nt
b
c
d
vn
34473
27890
3573
8715
25479
47811
42734
12834
0
0
1913
18923
29845
127
21577
18584
7
6120
6228
16846
41747
54
7297
3118
646
176
1117
857
8
1354
416
26
152
3
3
0
655
32
10
0
0
0
e
ad
nx
f
h
nz
i
Bg
27
5933
452
17201
47
3700
5033
8
23
496
112
12627
47
0
0
8
4
5435
49
4499
0
2090
0
0
0
0
133
75
0
1034
8
0
0
2
50
0
0
437
4966
0
0
0
108
0
0
139
59
0
j
10323
3723
3896
2169
518
17
k
Dg
1
n
an
928
135
6015
236810
2837
70
39925
24244
32327
3836
20680
4562
74829
928
135
0
16846
68
9
36339
22048
13800
0
0
4562
74363
0
0
26
186882
2769
51
3585
1909
18395
3655
14019
0
466
0
0
686
26800
0
1
1
235
132
179
3579
0
0
0
0
5104
5258
0
9
0
40
0
2
177
0
0
0
0
199
1024
0
0
0
12
0
0
2905
0
0
vvn
o
p
q
r
s
t
Ng
u
Author: Daniel Hu
MIT-EECS
Page 107
5/29/01
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
V
184764
1889
10
1391
496
1749
y
Rg
z
Tg
Vg
E5
There
There
There
There
There
118616
32
0
963
0
0
1176
0
0
217
0
0
17
0
0
210
0
0
2
0
0
1
0
0
Frequency data from U. of Penn. Treebank project
27187 words of the length 1.
47216 words of the length 2.
6330 words of the length 3.
857 words of the length 4.
468 words of the length 5.
are
are
are
are
are
POS
Total
1 Char
2 Char
3 Char
4 Char
Other
NR
7664
2651
17
112
4980
1974
239
2252
12170
3226
23
1365
2227
150
3879
29
3271
961
43
65
1155
1372
731
391
4
343
29890
8
864
92
558
14
15
112
2668
1852
4
2252
2424
2692
23
1095
2227
142
3493
29
788
961
43
14
345
1219
731
379
4
341
2099
4
567
92
4555
1783
2
0
2120
122
177
0
9516
494
0
270
0
8
382
0
2289
0
0
51
754
153
0
12
0
2
24226
3
297
0
2077
378
0
0
182
0
27
0
90
25
0
0
0
0
4
0
189
0
0
0
6
0
0
0
0
0
3351
1
0
0
314
121
0
0
10
0
1
0
137
15
0
0
0
0
0
0
5
0
0
0
50
0
0
0
0
0
204
0
0
0
160
355
0
0
0
0
30
0
3
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
10
0
0
0
NT
SP
BA
AD
CC
CD
DEC
VV
M
LB
LC
DEG
MSP
P
OD
JJ
AS
SB
CS
VA
DT
VC
VE
DER
ETC
NN
FW
PN
DEV
E6
64953
1857
10
0
496
1749
Organization part-of-speech results from People's Daily
337
423
158
130
216
117
81
92
1364
57
160
89
60
130
Author: Daniel Hu
nz_n_
n__
ns_n_n_n_
nsns_n_
ns-j_
n n n
nt-j_
ns_n_vn_n_
ns_n
n-n n n
nt_n_
nsnt_
nsns_n_n_
j-n-
MIT-EECS
Page 108
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
64
70
2487
135
576
342
69
E7
5/29/01
ns vn n
ns-nzn-n_
nt
ns nz n
ns n n
n-j_
nznn_
Phrase Tagged People's Daily Data (using Manually generated rules)
<DOC>
<HEADER>
<DATE>19980101-01 -001 </DATE>
Text orginally from People's Daily 1998
Segmentation and POS tagging done by PKU ICL
Modification and LEX tagging done by Daniel Hu at MITRE NLP
April 2001
TAGSET USED: PKU
</HEADER>
<BODY>
<TXT>
<P>
RN
<5>2
A
J1Iirttt
i# (iA4ft
1 4) </S></P>
<P>
<S><ENAMEX TYPE="ORGANIZATION">4
*M</ENAMEX>M3j2,
ffT-'E
N</S></P>
<P>
<S> (
+
t:
)</S></P>
<P>
<S>1 2,q 3 1 H, <ENAMEX
%</ENAMEX>) ,-t2iE,
TYPE="ORGANIZATION">@±
1 9 9 8 *-i1 tijr
E) . </S><S> (<ENAMEX TYPE="ORGANIZATION">W3
-</ENAMEX>j[TE)
</S></P>
<P>
R&l:</S></P>
&±i-],
<S>MF-j,Lqf1,1 ,
<P>
19
<S>-
9 8
1rifZ,
-Jt3ff<ENAMEX
R +
TYPE="ORGANIZATION">@ 5)A A
TYPE="ORGANIZATION">rp45j-L
K, P' f W '- Ij )V-f -,
A
rE
t
</ENAMEX>,
<ENAMEX
@ M e</ENAMEX>fj<ENAMEX
TYPE="ORGANIZATION">@
iV-
</ENAMEX>,
Fr'3
AR, ]-!jjf
f I- ft R f R f ! </S ></
gr-m
jw
P>
<P>
< S>
1 9 9 7 *, J31rP V M t
-L M IW &T -T R - </S> < S>@@R
gy/sg FMKZ ' - 1
,k fE L*
Pil. </S><S><ENAMEX
TYPE="ORGANIZATION">4rJ
r</ENAMEX> IrI'
ggkgfI_*3, R P g"- s
", "%
P". i- g
j1rtPj Pffl thmX/52T
-TPE='ORGAIZAIN</S><S><ENAMEX
TYPE="ORGANIZATION">r@pX ,~l'j</ENAMEX>pr)jojfERifTg+H).!
#,
J
t-. </S><S>N-1AJ1i~x,
g9%f~lKG5E~@M
-ff*MP-,
RT W@ft
T
&2
tft',
EN/sW{*P
fa #] </S></P>
<P>
N,
tt
/>SR0ei@
%%2%AMTR,
3
f~tHER3%@ERMt}NWM~g&#,MR~ggggg.</S><Syg Wg
*d482ethog~f@@2.</S></P>
<P>
<S>t
A-*r,
VMT-NMMd.
T
Author: Daniel Hu
KN~kkMM B151iinit
</><S@@$]MMMC
MIT-EECS
&M@g~k AN
/>SHE~li
3%,{~ , is@x, Fj*,,tAz
W1)%ih</S><S>@P1
Page 109
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
5/29/01
fR54Rt<ENAMEX
TYPE="ORGANIZATIN">MA6-
)
H
</ENAMEX>j i , -5N T
-
H0 f
@
-g
vg
E-f A - I t
</S><S>y
1S/</S></P>
19
<P>
<S>1 9 9 8
X>Ryf)lg,3
,</S><S>YPE=ORGANIZAieNT>E
NAE,
SE,</S></P>
<P>
iP F(M /t i) ,
1
1
9 9 9
2
)
An 1- C, 9 - 49P*1 R
</S></P>
<P>
±-T1ffegk
<S>-h--ArK
Oer",
"YPE "-
ff-lfff,
"I
</s>sg$3tM~rRf
N A" P3, O >
L<S><S>'7VV$lfil"Et*
W<
</S><S>4R<ENAMEX
4,Egne
</1>S~j
TYPE="ORGANIZATION">-MAM</ENAMEX><AM,,
RT
</S-><Sg
ggf, agT
-Jtle,
g
ig
</S></P>
<P>
<S>5A$,a
<D
sigtjeg~ihM, irh#@Wile, Ehe3ih ~ii#)MME,</S><S> -3 , ft
;-,</S><S>f
4 V Eand
,P OSP
Sen -T9in
N FT a
i
<tS > < P >
<P>
<S><ENAMEX </S><S
F3 (<NAE T</EN'RAMI
TIN>J 4R&)M > 4
2/t31fj,<S>/P
</ENAMEX>
Hfu aLtMITRJE
g!! NLP
tig2
@fQ-Tgg,
TYPE="ORGANIZATION">nLEtaRgigd
</HEADER>P[9ff-</><n>
hr3<ENAMEX
TYPE="ORGANIZATION">R-, -_-</ENAMEX>fU- ,t<ENAMEX
TYPE="ORGANIZATION">EgrT8A$9p</ENAMEX>
d
<P>
MK
DAE2
t,)ii,
% N%%17%Eijg,
($4f- 1
JT$F;TEW
/P></S></P>
9900-1-00</DTE
<S> EAWHOVrO IWB4rll,
R$Af~h,%$5!</S></P>
<P>
<S>it
it ! </S><S>
(<ENAMEX TYPE="ORGANIZATION">fT:T$i:</ENAMEX>JL-I I 2)q 3
1
H F4-)
</S></P>
</TXT>
</BODY>
</DOC>
E8
Correctly tagged People's Daily data
<DOC>
<H EADER>
<DATE>1 9980101 -01 -001 </DATE>
Text orginally from People's Daily 1998
Segmentation and POS tagging done by PKU ICL
Modification and LEX tagging done by Daniel Hu at MITRE NLP
April 2001
TAGSET USED: PKU
</H EADE R>
<BODY>
<TXT>
<P>
<S>2
Author: Daniel Hu
E # E fr22 AA /lFr~#18(|W [t I q-)
</S></P>
MIT-EECS
PapIIC
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
<P>
<S><ENAMEXTYPE="ORGANIZATION">,M
5/29/01
198</ENAMEX> ' -), M
<P>
tN+-M
<S> (<P>
8]
9
<S>1
NN+-Z)
</S></P>
O
T
1 H, <ENAMEX
3
%p</ENAMEX>
TYPE="ORGANIZATION">@r
tE) . </S><S>
<P>
<S>r
i
-f
,
1 9 9 8 *7V$1#18
T2E, NMEX
11Jr-[y%)
(<ENAMEXTYPE="ORGANIZATION">tT/AE</ENAMEX>
jE
</S></P>
</S></P>
<P>
/tNf<ENAMEX
IVAR
X2,'R+
<S>& 1 9 9 8 4
TYPE="ORGANIZATION">*g ARCMW- r-=</ENAMEX>, <ENAMEX
TYPE="ORGANIZATION">rp g
F;T-0t-:1</ENAMEX>fQ<ENAMEX
TYPE="ORGANIZATION">r></ENAMEX>,A,
%AJU
E,
±,I
9N, F-]f W grMIn9 -f 1-, R W A VIi]M 0fD Q , Wj ?R f- ! </S ></ P>
<P>
9
$2
hlM2OT)-4</>>@Aeg3/s5W
12W5%2?ttX5A MM.</S><S><ENAMEX
TYPE="ORGANIZATION">,RFAf-</ENAMEX>gl41f],FgggafWVfftjV,
<S> 1 9
7,
tE
,
Mg"ggj,"~g"
g
Wit~f@ghe@%,</S><S><ENAMEX
TYPE="ORGANIZATION">@@MPRl'</ENAMEX>#X'jMENTMKiTHM1 9 9T1 2rfr2, Mit
$T
fff~+H)Z _ ftmA,
MSS/syge-,
g
, </S></P>
<P>
TDMREMMMR &{$2%2% %#MER71L&WM3 tNAMMEIEK</S><S>J "O '
it, 3d442% kgg~fTM2,</S></P>
<P>
<S>&R45,
yn~kkW B510N Mt,
T
TYPE="ORGANIZATION">,],
$
T
@@NR-ffaMTEME,</S><S>AAA-G-TL,
A
/S>S>@4MiBM1C
'-Mp,,</ENAMEX>riiibM,
j-,hp-
Dri[, 9%2
rPM,
T
RM
&2
MF-q E Fj
&1
$0%,</S><S>@@FR
NMWo]-&M
</S><
&t& MK</S></P>
<P>
<S>1 9 9 8-T-,
alr@
*ANNO
4 2, *HM r 2 0 *-*
</S><S>f \ R ieI t Wt
apm@$%Zif Otm t Z5fl
ALtI I, IrIJA. </S><S>'RWRfIle]Agg
ZW*R,
MIRA~,
R rpa[Mie.TJ, IN,
f-FWl@M
V, T", RTi -Mk
)f;iMZe,
ArI~g~f
2%RnggEC,</S></P>
<P>
M Mia ,1
#0&%
9 9 9 * 1 2
<P>
AMRM,</>S%5t$]"t
<S~egg~giTU~e1] -gg</S><S>'pg6
3",
:
"-@-
t,
"e
n
4i1 sit2twA
$J%</s></P>
</>T>,M
VL~gfg
2011FMMMtiM$J@MSWW,</S></P>
Author: Daniel Hu
MIT-EECS
Page I I I
eg
agt
Development of Part of Speech Tagging and Syntactic Analysis Software for Chinese Text
<S><ENAME,
5/29/01
,<
naTgr
n.*RrtR
;
.<S<>At|i@EB~f
-YPE='ORGNIZATIOAM-E
;
T IET&983
l83%3aihTnit
</S><S>fF
;
4
%2-9%2% Allimf)#B~tnEA3!@3RiaihiB5.</S></P>
<P>
<S><ENAMEX
TYPE="ORGANIZATION">>rM-</ENAMEX>AMEX>#jit
taCf4n,
M7J R djtW G @kWMK</S><S>rP[fftb13p Wa<ENAMEX
TYPE="ORGAN IZAT ION ">IK--[M</ENAM EX>fQ Itt
VM P O' t jM,
f -M,
Yl NixAfa-, Irl2
rniE7!
gW,
-f4AMil7
- ZPL,
8
</S></P>
<P>
<S>&WI~a~rMW~sRIJ
RTAR~~tA5<, E~tli
</S></P>
<P>
<S>itif ! </S><S> (<ENAMEX TYPE="ORGANIZATION">f $t</ENAMEX>1LT
</TXT>
</BODY>
1 2 ) 3 1 H 0)
</S></P>
</DOC>
Author: Daniel Hu
MIT-EECS
Page 112
j IF
Download