Minimally Supervised Named Entity Recognition in French

advertisement
Minimally Supervised Named Entity Recognition in French
Yu CHEN and Jennifer PLANUL
Course Project for “NLP Applications”
Submitted to: Dr. Claire Gardent
Supervised by: Dr. Claire Gardent, PD Dr. Guenter Neumann (DFKI), Christian Spurk (DFKI)
Contents
1
Introduction
1
2
System Architecture
1
3
Implementation
2
3.1
Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
3.2
Chunking and Transformation Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
3.2.1
Chunking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
3.2.2
Noun Group Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
3.2.3
Transformation Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
Named Entity Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
3.3.1
The Seed-driven Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
3.3.2
Seed lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
3.3
4
5
Evaluation
7
4.1
Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
4.2
Corpus Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
4.3
Evaluation Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
4.4
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
Discussion
12
5.1
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.2
Futurework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1
Introduction
Named entities refer to those entities “for which one or many rigid designators stands for the referent” (cf.
Wikipedia). Named entity recognition (NER) includes identifying and classifying the NEs. It is essential for
many other natural language processing tasks. The NEs not only carries interesting semantic information
but also facilitate the later processing as a multi-word NE can be considered as a while. In practice, getting
NEs right can help a statistical machine translationi system improve up to 10% in terms of BLEU scores.
Although it is only a subtask of information extraction (IE), it is NOT trivial to solve. There are several
difficulties that have been discussed in the course. One of the important causes is that names are an open
class so that any words can be used to name something.
There are two main approaches for NER problems: hand-crafted rules and machine learning. Rule-based
systems may be very efficient and accurate, especially for specific domains, but they are expensive to build.
Among the machine learning method, the supervised approach requires a large amount of annotated data,
which is also expensive. Many recent researches have been focusing on unsupervised or semi-supervised
learning methods, e.g. seed-driven bootstrapping approach introduced in [1, 13, 2]. An English NER system
has been developed by [10] and made available online. However, most researches in this area work on
English and so far as we know, no semi-supervised method has been studied for French NER tasks. Even
the other kinds of approaches are not so well studied for French. Both Sprout and Tagen [8] are applying
finite state techinques on NER.
We aim to adapt the English NER system kindly provided by Christian Spurk and Dr. Günter Neumann
from DFKI to French and carry out evaluations on the new system. The adaption includes data selection,
reconstruction of the architecture and the adaption of the existing codes. As for evaluation, we need to
develop our own evaluation method. We are grateful for the support from the supervisors, Dr. Claire
Gardent, Dr. Günter Neuman and Christian Spurk, through the whole projects. We have learnt a lot via the
project.
In this report, we first present the architecture of our system in Section 2. The implementation details are
described in Section 3. Section 4 introduces our evaluation strategy together with all the results from the
experiments. The report finishes with a brief discussion, including the problems and future directions, in
Section 5.
2
System Architecture
Raw text
Classified
Names
Sentence Tokenized
Chunked Noun Group
Noun Group text
Detection text
Chunker
Transformer GN Classifier
Tokenizer
NG Rules
English Named Entity Recognition
Transformation models seeds
rules
Figure 1: Architecture of the English Named Entity Recognition System
1
Fig. 1 shows the minimally supervised English NER system introduced by [10] (refered to as “the English
system” later). The system are composed of two main components: the rule-based noun group chunker,
which acts as the NE recoginizer, and the trie-based classifer trained with bootstrapping techniques. As
for input and output, there are several interfaces designated for different formats of existing text resources,
including some existing test suites.
The architecture of our system is largely baseed on this system. However, there are still some differences
between our system and the English system. Unlike the original self-contained solution, our final system
includes several external modules, as we aim at adapting the English NER system in an economic way.
Meanwhile, the interfaces between modules are then built according to the input/output format and encoding
of each module. In addition, a new test suite is developed as a part of the system due to the lack of suitable
evaluation resources.
The main building blocks of the architecture (illustrated in Fig. 2) are:
Preprocessing The original PASSAGE [4] corpus is made of XML documents. The next module takes raw
texts as input, hence the XML tags must be removed. Besides, it is important to sample the corpora,
as any of the corpora we can access is actually larger than necessary. Section 3.1 gives more details
on this module.
TreeTagger Instead of developing internal NP chunker, we take TreeTagger [9] as the chunker [6] in the
system. Note, the TreeTagger is the only module that working on Latin-1 encoding (UTF8 for the
other modules) so that encoding conversions are needed before and after applying the TreeTagger.
The description of this module can be found in Section 3.2.1.
Noun Group Transformer The rule-based noun group transformer is used to modified the noun phrases
produced by the chunker into smaller noun groups that are more close to NE’s. In the original design,
the transformer is an optional component inside the classifier. It is made separate from the classifier
in order to reduce the computation time. This module is commendatory in our system as it can also
eliminate the chunker’s errors to a certain extent. Section 3.2.2 The details of the applied rules are
introduced in Section 3.2.3.
Generalized Name Classifier The classifier in our system remains almost the same as the one implemented
in the English system. More details can be found in Section 3.3.1. We also develop the seed lists
for different types of NE’s used during the training of the classifier. The seeds are described in
Section 3.3.2.
Evaluation As mentioned, we have developed our own test suite for evluation. Further information is
included in Section 4.3.
3
3.1
Implementation
Preprocessing
Many of the corpora are XML documents or documents in other format. The clean up of the corpus is
essential for later processing. It includes removing the XML markups, the noisy characters, e.g. extra
spaces, and sometimes identifying sentence boundaries. Moreover, a corpus may consists multiple files.
The texts from different files are saved to one single file to facilitate the following procedures.
2
Raw text
XML Cleanup
XML Documents
Evaluation
Classified
Names
Transformed
Chunked Noun Group Chunked
TreeTagger text
text
GN Classifier
Transformer
Transformation Rules
French Named Entity Recognition
seed list
Figure 2: Architecture of the French Named Entity Recognition System
For each of the corpora, a clean-up script is implemented in Python with the re (Regular expression
operations) module. Different operations are required for different corpora. For XML files in ESTER [5]
corpus, we removed XML markup and also some noisy characters : ^^before unannotated names and *
before oral corpus problems (repetition, stamming, missed word...). In the “Le Monde Diplomatique” [4]
corpus, there are underscore line separators that need to be replaced and reference number between brackets
to be remove. As for the XML files of the WikiNews [4] corpus, we need to remove many useless XML
nodes, various types of markups, noisy characters.
It is not feasible to experiment over the entire set of all accessible corpora. How the corpora are selected
for the experiment is described in Section 4.2. Even if we narrow down to only a single corpus of a smaller
size, there are still many problems. First, except for the ESTER corpus, none of the other corpora is designated for the NER tasks. Named entities are much less frequent as in the CoNLL corpus used for the English
system. Sentnences without any NE’s may bring in too much noise into the training process. Second, if we
want to study the performance of training over data of variable size, it is necessary to ensure we can draw
conclusions over different subset of the corpus just as confidently as if the entire corpus had been used.
The solution to the first problem is filtering. The NE’s in French usually contains at least one capitalized
word, thus the capitalized words that are not at the beginning of the sentences are good indicators of sentences with some NE’s. These sentences are certainly kept for later use. It does not mean we simply remove
all the others. It is still possible that the first word belongs to an NE which cannot be simply detected. A
small amount of sentences without any NE’s can still give some negative evidence on the behavior of NE’s.
Therefore, we only filter out a portion, eg. 80%, of the sentences without any capitalized words inside.
As for the second problem, we employ a random sampling method to extract data from a corpus. In
practice, the filtering and the sampling are performed on chunked text rather than raw text in order to
avoid unnecessary repetition of chunking. The modules after the chunker does not consider contexts across
sentence boundaries, therefore the samples do not have to be continuous. We first extract 500 sentences
randomly from each (filtered) corpora as the test sets. Among the rest of the sentences, we, in the same
manner, randomly select for the training set according to a given size.
3
3.2
Chunking and Transformation Interface
3.2.1
Chunking
The recognition tasks usually start with identifying the noun phrases. In the English system, this is accomplished by a set of regular expression rules that split a sentence into noun groups (more general than noun
phrases). Due to the time limit, we chose to use an existing French chunker included in the TreeTagger program. The other components of the system are working with UTF8 encoding, but the TreeTagger chunker
only accepts plain Latin-1 texts as input and outputs the tagged texts in the same encoding. We need
to convert the encoding before and after the application of TreeTagger. It is simply done using the unix
program iconv.
3.2.2
Noun Group Transformation
The noun phrases given by the chunker are not by a long way what we call named entities. In many cases,
an NE is only a subgroup of the words inside respective noun phrases. To discover these words groups, we
adapt the noun group transformation technique implemented in the English system.
The noun group transformation component, a.k.a. the “transformer”, transform noun groups, i.e. noun
phrases and the subgroups, into smaller ones. It consists of a set of transformation rules represented by
regular expressions. Eventually, the chunks from the input are transformed in to a set of noun groups, some
of which are NEs.
During the transformation process, for each given noun group, all the rules are applied in a sequence
according to the assigned priorities and the number of noun groups increases through the process. If there
are too many rules or too many noun groups, the computation of this module delay the whole system
significantly. When the system follows the original design to embed this process in the training cycle, it
takes over 26 hours to train on the French “JRC-ACQUIS” corpus, which are composed of extremely long
sentences.
However, we cannot omit the transformation because this process not only locates true NEs but also helps
to correct many errors from the chunker. It is not practical to redesign the component all over again, either.
In order to minimize the affection, we isolate the transformer from the classifier to the maximal extent. The
chunked texts are transformed into the same texts annotated with different sub-groups simply based on the
rules. The resulting texts are saved directly on the hard disk so it can be reused at one’s disposal. For the
present settings, training over a set of 200, 000 sentences can be finished in around 15 minutes.
3.2.3
Transformation Rules
The construction of the transformation rules starts with the guideline given in the English system. Many
changes are made because of the differences of the languages or the differences of the two systems.
Many English constructions are not seen in French. For instance, there is no genitive mark in French.
Compound noun phrases in English are expressed with noun phrases with post-modifiers in French. There
is no phrases like "ORG coach/editor/chief/... PER" constructs to split.
The following operations either have been done by the TreeTagger chunker or have never been seen in the
corpora so that it is unnecessary to repeat them in the transformation process.
• to remove the number tokens following/preceding tokens with a capital letter.
• to split conjunctions that actually are two separate NGs. For example, “Abou Ghraib et Guantanamo”
are annotated as two NPs by the chunker.
4
• to remove dates from the end of any NG, which are never seen in the data
Many new rules are added particularly for French phenomena. We also try to use such rules to recover
some of the errors the chunker has made. The rules are sorted by priority. The final rules are listed with
examples as follows:
1. Removing the punctuations and spaces remaining at the beginning and the end of the NG:
(Tiers Monde -)N G → (Tiers Monde)N G
2. Removing the punctuation inside the NG and split it into two NG:
(un redressement spectaculaire ; néanmoins)N G
→ (un redressement spectaculaire)N G (néanmoins)N G
3. Removing determiners from the beginning of any NG, which includes three rules:
• one for “tout(e)”, “tou(te)s”:
(tous les autres États)N G → (les autres États)N G
• another for articles: “le”, “un”, demonstrative and possessive adjectives: “cet”, “votre”:
(les autres États)N G → (autres États)N G
• and the last one for “autre(s)”:
(autres États)N G → (États)N G
4. Removing titles which begin by an uppercase letter and do not belong to person names:
(M. Blair)N G → (Blair)N G
(Président Bush)N G → (Bush)N G
5. Tearing apart the lower case tokens that precede upper case tokens:
(gouvernement Bush)N G → (gouvernement)N G (Bush)N G
(co-président d’Alcatel)N G → (co-président)N G (Alcatel)N G
(Chine de l’ après-Mao)N G → (Chine)N G (après-Mao)N G
6. Tearing apart location adjectives from country names (sometimes it is done be the chunker already):
(Corée du Nord)N G → (Corée)N G (Nord)N G
(Europe de l’ Est)N G → (Europe)N G (Est)N G
(Guinée-équatoriale)N G → (Guinée)N G (équatoriale)N G
Note tjat some rule may break NEs into pieces. When the first word in the phrase “mer du Nord” is
mistakenly capitalized, it will be split into two parts: “Mer” and “Nord”. There are many special location
names beginning with a lower case word, such as “golfe Persique”, “océan Atlantique”, which will be broken
by the fifth rule. The rules may work differently on different corpora.
3.3
3.3.1
Named Entity Classifier
The Seed-driven Classifier
The Generalized Name (GN) classifier in the English system is implemented with a seed-based bootstrapping algorithm. Given a tokenized sentence, the classifier annotates each marked noun groups with estimated
categories according to the model trained with unannotated texts and a small list of classified seeds.
5
The basic assumption of this classifier is the spelling (internal) features and the contextual (external) features of a noun group, a potential NE, imply the evidence for the category it belongs to [3]. This assumption
still holds for French. For instance, most all-caps words fall into the category ORG, then “EDF” is more
likely to be an organization name. Also, there are more person names start with “Jean-”, then both “JeanMarie Lehn” and “Jean-Pierre Koenig” tend to be used as person names. There are surely some negative
examples, such as “Jean-Coutu” as an organization name, hence we also need the contextual features. If
there are more location names following the word “en”, the noun “Bosnie” after “en” is also likely to be a
location. Similar to the spelling features, many constructions can be seen with multiple categories. The two
types of features should be considered interactively.
The features of a noun group are encoded with four tries for an arbitrary number of categories. There are
two morphological tries to keep the spelling features, one for prefix and another for suffix. The contextual
features are stored in another two contextual tries, one for the suffix of left context and another for the prefix
of the right context. Each of the nodes in the tries contains structured frequency information with respect to
categories. The leaf nodes from different tries can be connected via links to indicate the relations between
the internal patterns and the external patterns. The frequency of such links are also counted.
The tries are first built from unannotated training corpus. When seeds are entered into the tries, the
node frequency information is adjusted by bootstrapping. If a path of a seed in the morphological tries
reach a sufficiently high threshold probability for certain category, then all the contexts linked to the trie of
this path are re-estimated for the category. In return, if one of the contexts reach the threshold, all linked
morphological tries need to be re-estimated. This recursive procedure continues until no more adjustment
are possible. The modified tries are used as a model for classification. Given an NG from the target corpus,
the classifier then assign the most probable category, the most frequent one for the node corresponding to
the NG in the final tries.
The core algorithm has been implemented in a langage-independent manner. When adapting the system
to French, we restrict our modification to the minimum in order to avoid losing its function as an English
NER system. As mentioned in previous sections, we do not use internal chunker. In order to connect the
existing training component with the external modules, we create a new interface (cf. FrenchReader in
java codes) for chunked texts so that the chunks containing in the input source texts can be passed together
with the tokenized sentences to the tri building block. Accordingly, a new output module is designed for
French text. We also add an option for transformation inside the training module to allow the module build
tries without transforming the noun groups. The setting of this option depends whether the input has already
been transformed or not. Apart from these three changes, the program work perfectly with French input. We
also added a pair of functions to save a trained model (the set of tries) into a file and to retrieve the model
from the file (cf. GeneralizedNameClassifier).
In addition to the basic functions of the system, we provide an new extension to the training module.
The input for this module in our system is the output from the TreeTagger chunker. It contains not only
annotations of chunks but also part-of-speech (PoS) tags of all the tokens. Even if the PoS tags do not
provide as strong evidence for NE classification as the spelling features and the contextual features, they can
still be a backup reference when the decision is difficult to make. For example, it is less probable for a noun
group with a PoS pattern “NAM PRP:det NOM” (Corée du Nord) to be a person name.
This is implemented with additional tries for PoS tags (cf. PoSTrieBuilder class in the java source).
Each tag from the French TreeTagger tagset is encoded with one letter as what information a tag holds is
not changed by how it is represented. Additional four tries are built for the encoded PoS tag sequences
and the bootstrapping algorithm is applied to the four tries in the same way. This is only a prototype
implementation (cf. PoSGeneralizedName class in the java source). The PoS tag tries have not been
linked to the original tries yet. It is not clear yet how these tries should react with the original ones. It is still
6
an optional extension rather than independent module. Some results from a preliminary experiment of this
extension are discussed in Section .
3.3.2
Seed lists
The seed lists are built from the only annotated corpus: ESTER. We use a Python script to identify the
annotated NEs, compute the frequency of each NE, and then build the entry-level lists. Although the ESTER
annotation provides more fine-grained categories and subcategories, we only consider the three typical NE
categories: LOC (location), PER (persons) and ORG (organisations), in the experiment. The NEs of the GSP
(geo-socio-politic groups) category in ESTER, in which country names are the majority, are also included
in our LOC seeds. The other categories and the subcategories are ignored.
There are 46167 annotated NE occurences which correspond to 6770 different NEs in ESTER corpus.
Only 11.8% of all the NE variants, 497 NEs, occur more than 10 times, but the occurrences of these NEs
account for 72.3% of all occurences. If ESTER is used for testing, a system trained with these NE’s as seeds
can easily achieve good performance which cannot, in any way, reflect the real quality of the system. For
this reason, we do not include ESTER for either training or testing but only use it to create the seed lists.
We select the NEs that appear more than ten times as our seeds. Supposedly, these NEs can still be seen in
the other news corpus, however would not be as frequent as in ESTER.
In addition to automatic extraction, we also manually clean up the lists. Some NEs were not correctly
classified in ESTER. Some are included into the lists by the script several times when followed by different
modifiers. We also remove ambiguous NEs. Finally, we obtained three lists with the following numbers of
seeds: 266 in LOC, 272 in PER and 82 in ORG. The full clean list including NEs that appear less than 10
times are also included in our experiment.
4
4.1
Evaluation
Experiment Setup
All the components have been tested on an Intel Xeon machine (CPU: 2.66GHz×8, RAM: 16G) with 64-bit
Fedora Linux installed. The programs from us are supposed to be platform-independent, but they have never
been tested in Windows or in MacOS.
The entire system (from the XML documents to the classified outputs) requires approximately 3.5G of
free hard disk space, not including the installation space of external software. The faster the CPU (1GHz
recommended) and the more memory (2G recommended), the better the system is going to perform. It takes
258M RAM to train on 5000 sentences of moderate length (less than 30 tokens in average) and 2.26G to
train on 200, 000 sentences. The required memory is almost linear to the size of training data.
The system minimally requires a JDK version 1.5 or later and a Python version 2.4 or later. An installation
of TreeTagger is necessary. Moreover, Tagen [8] is needed for a part of the evaluation.
4.2
Corpus Selection
In this project, we are seeking corpora with more named entities to allow not only effective but also intensive training of the classifier. After manual inspection of the PASSAGE corpus and many other corpora, we
chose to investigate 6 French corpora more thoroughly: the French section of “JRC-ACQUIS Multilingual
Parallel Corpus” (Acquis) [11], the French section of “European Parliament Proceedings Parallel Corpus
7
1996-2006” (Europarl) [7], the French section of “News Commentary Corpus” (NC) [12], the “Évaluation
des Systèmes de Transcription d’Émissions Radiophoniques” corpus (ESTER) [5], the “le Monde Diplomatique” (MD) and the French Wikinews corpus (Wikinews) from the PASSAGE [4] corpus
Table 1 lists several parameters for each corpus, including the overall number of the sentences (S), the
number of tokens (T ), the number of capitalized tokens that are not at the beginning of a sentence (TC ), the
number of sentences containing such tokens (SC ), the number of noun phrase chunks and the number of
tokens tagged with the “NAM” (name) tags (TN AM ). All these parameters are based on the outputs from
TreeTagger.
Corpus
Aquis
Ester
Europarl
MD
NC
Wikinews
S
151356
40306
1323414
938535
42159
18917
T
8050576
1150133
41087716
24426013
1214523
592394
SC
86227
24772
686862
552019
23525
12594
TC
463249
71457
1460232
1593612
55416
47273
CN P
2104925
283807
9500316
6116565
293302
153152
TN AM
266212
65529
889720
1200536
44101
38194
Table 1: Corpus Parameters
The parameters together give some implicit evidence of how suitable a corpus is for our tasks. Such
evidence are more clear from the combinations of the parameters in Table 2. The depth of the tries built
Corpus
Aquis
Ester
Europarl
MD
NC
wikinews
T
S
TC
S
TC
T
TN AM
S
TN AM
T
53.190
28.535
31.047
26.026
28.808
31.315
3.061
1.773
1.103
1.698
1.314
2.499
0.058
0.062
0.036
0.065
0.046
0.080
1.759
1.626
0.672
1.279
1.046
2.019
0.033
0.057
0.022
0.049
0.036
0.064
CN P
S
13.907
7.041
7.179
6.517
6.957
8.096
CN P
T
TN AM
CN P
TC
CN P
0.261
0.247
0.231
0.250
0.241
0.259
0.126
0.231
0.094
0.196
0.150
0.249
0.220
0.252
0.154
0.261
0.189
0.309
Table 2: Combination of the Parameters
during the training process is linear to the length of the sentences. On the other side, only contexts of a
maximal length 3 are considered. The extremely long sentences may increase the complexity unnecessarily.
Hence, the corpora with moderate average sentence length ( TS ) are more desirable. The capitalized tokens
inside a sentence and the ones tagged with a “NAM” tag are more possibly to be NE’s. The ratio of these
tokens over all the sentences/chunks/tokens reflects the estimated density of NE’s in a corpus. In order to
train the classifier more effectively, the higher ratio is better.
According to the data shown in the tables, Acquis and Europarl are not suitable for their low NE density
(per tokens) and the averagely long sentences in them. The ESTER corpus, a speech transcription, and the
Wikinews with unrecoverable noise are removed from the list even though both of them seem promising
in terms of the ratios. The annotated ESTER corpus is used for extracting the seed lists. Finally, the MD
corpus and the NC corpus are chosen for both training and testing.
4.3
Evaluation Strategy
We randomly extract a set of 500 sentences from each of the two selected corpora as the test documents for
the evaluation. The two set are manually annotated with three categories of NEs. We adopt a simplified
8
MUC-like annotation scheme in the form of SGML text markup, in which a identified NE is marked up with
inserted tags enclosed in angled brackets. The markup will have the following form:
<enamex="CAT-VALUE"> text-string </enamex>,
where “CAT-VALUE” can be one of “LOC”, “PER” and “ORG”. The system also gives output in this form.
We also supply a script to transform the annotation in Tagen’s output into this form and to remove uninteresting tags, such as those for temporal expressions.
The uniform annotation scheme in the files from different sources allows us to apply the same script (cf.
evaluate.py) for the evaluation task. The manually annotated test documents are used as references.
There are 457 NEs annotated in the MD test document with 346 not in the seed lists. As for the NC test
document, 323 out of 662 NEs are not seeds.
We compute the precision, the recall and the f -score of a system output based on the reference as usual.
The scores are also computed for each of the NE categories. Furthermore, we zoom in on the results that do
not appear in the seed lists so as to find out the learning ability of the seed-driven system.
4.4
Results
Several experiments are performed on the selected data so as to evaluate the system from different aspect.
The results obtained from the system are compared with an existing system Tagen, which is developed with
finite-state techniques. It is sometimes impossible to train on the data as the target document. Hence, we
carry out a “out-of-domain” test with the data to have a rough view of this situation. In addition, we also
investigate the impact of two parameters to the performance. At the end, we have a preliminary evaluation
on the proposed PoS extension to the classifier. The reults of these experiments are discussed in the rest of
this section.
Comparison with Tagen We train the system with 2000 sentences and test it with another 500 sentences
from the same source. The seeds used are the most frequent ones. Table 3 and Table 4 compare the results
from our system to those from Tagen. Tagen does not depend on any particular frequent NEs, so its performance on the less frequent NEs is rather close to the overall scores. The difference shown in our system is
much significant. Particularly, the recall may decrease up to 90% (NC data) on the NEs not included in the
seed list. The system outperforms Tagen on NC data, but gives worse results for MD data. We cannot draw
absolute conclusion on which is the better, but our system is promising to beat Tagen eventually with further
improvements. Our system has low recalls in general. It is mostly the result of the unreliable recognition.
Many NEs cannot even reach the classifier. The system is more capable to work on location names and the
person names than on organization names as the ORG seeds are much less than the others.
Type
LOC
PER
ORG
All
Prec.
0.905
0.667
0.576
0.826
Overall
Rec.
0.637
0.219
0.254
0.473
FrenchNER
Out-of-seed-lists
f
Prec. Rec.
f
0.748 0.190 0.033 0.056
0.330 0.391 0.083 0.137
0.352 0.625 0.060 0.109
0.601 0.346 0.058 0.099
Tagen
Prec.
0.841
0.541
0.091
0.730
Overall
Rec.
0.634
0.241
0.022
0.429
Table 3: FrenchNER vs Tagen on NC data
9
f
0.723
0.333
0.036
0.540
Out-of-seed-lists
Prec. Rec.
f
0.680 0.579 0.625
0.480 0.222 0.304
0.091 0.036 0.051
0.522 0.310 0.389
Type
LOC
PER
ORG
All
Prec.
0.704
0.562
0.455
0.661
Overall
Rec.
0.393
0.164
0.041
0.243
FrenchNER
Out-of-seed-lists
f
Prec. Rec.
f
0.504 0.200 0.065 0.098
0.254 0.517 0.143 0.224
0.075 0.444 0.034 0.063
0.355 0.346 0.078 0.127
Tagen
Prec.
0.767
0.391
0.250
0.587
Overall
Rec.
0.545
0.327
0.065
0.363
f
0.637
0.356
0.103
0.449
Out-of-seed-lists
Prec. Recx. f
0.649 0.387 0.485
0.378 0.324 0.349
0.250 0.068 0.107
0.459 0.260 0.332
Table 4: FrenchNER vs Tagen on MD data
Out-of-domain tests The out-of-domain tests are also performed on the NC-MD combination.The left
part of Table 5 lists the scores of the system trained on MD data but NC data, vice versa for the right
part. Both sets are news. Ideally, it is better to use data from more different sources, e.g. training MD but
testing on Europarl. When the system is targeted on a document different from in which it is trained, the
performance decreases as expected. However, the decreases are not really obvious. In fact, the recalls are
generally better when the system is trained on a different set. One reason could be that the system may have
to consider more evidence if none of the evidence dominates, which should be expected when the targets
are so different from what the system has seen in training.
Type
LOC
PER
ORG
All
Prec.
0.895
0.717
0.562
0.818
Overall
Rec.
0.629
0.241
0.269
0.476
MD → NC
Out-of-seed-lists
f
Prec. Rec.
f
0.739 0.048 0.008 0.014
0.361 0.500 0.111 0.182
0.364 0.538 0.083 0.144
0.602 0.345 0.064 0.108
Prec.
0.679
0.525
0.143
0.546
Overall
Rec.
0.406
0.191
0.049
0.258
NC→MD
Out-of-seed-lists
f
Prec. Rec.
f
0.508 0.224 0.089 0.127
0.280 0.486 0.171 0.254
0.073 0.125 0.043 0.064
0.351 0.270 0.098 0.144
Table 5: Training and Testing on different data
Size
100
250
500
750
1000
2000
5000
10000
25000
50000
100000
200000
Prec.
0.370
0.505
0.567
0.599
0.617
0.732
0.845
0.879
0.897
0.897
0.897
0.897
LOC
Rec.
0.467
0.458
0.449
0.431
0.409
0.400
0.387
0.387
0.387
0.387
0.387
0.387
f
0.413
0.480
0.501
0.501
0.492
0.517
0.530
0.537
0.540
0.540
0.540
0.540
Prec.
0.239
0.361
0.345
0.405
0.441
0.548
0.722
0.824
0.933
1.000
0.933
1.000
PER
Rec.
0.236
0.200
0.182
0.136
0.136
0.155
0.118
0.127
0.127
0.127
0.127
0.127
f
0.237
0.257
0.238
0.204
0.208
0.241
0.203
0.220
0.224
0.226
0.224
0.226
Prec.
0.250
0.286
0.333
0.500
0.444
0.600
0.600
0.500
0.600
0.750
0.750
0.750
ORG
Rec.
0.164
0.082
0.082
0.066
0.066
0.049
0.049
0.033
0.025
0.025
0.025
0.025
f
0.198
0.127
0.132
0.116
0.114
0.091
0.091
0.062
0.047
0.048
0.048
0.048
Prec.
0.319
0.450
0.492
0.558
0.572
0.689
0.809
0.847
0.889
0.904
0.897
0.904
ALL
Rec.
0.330
0.295
0.287
0.263
0.252
0.247
0.232
0.230
0.228
0.228
0.228
0.228
f
0.325
0.357
0.362
0.357
0.350
0.364
0.361
0.361
0.362
0.364
0.363
0.364
Table 6: Training on different size of corpus
Training of Variable Size We extract different size of training data from the MD corpus. The test data
remains the same as that in the previous experiment. The results are given by Table 6. We expect the system
to perform better with larger training set. It turns out not always true. Figure 3 is the learning curve of our
10
system. The f -score actually drops when training data is from almost the same size of the test document to
twice larger. The performance goes steady rather than increasing more if the training data is larger than 10
times of the test document.
f-score for different sizes of corpora
0.365
0.36
0.355
f-score
0.35
0.345
0.34
0.335
0.33
0.325
100
1000
10000
100000
Training Corpus
Figure 3: Learning Curve of the System
Furthermore, the curves of precisions and the recalls in Figure 4 also become stable after a certain size.
Unlike the precision and f -score, the recalls decline when the training set gets larger. It is because more
noise is introduced into the system with larger training corpus. Non-NE noun groups are nevertheless the
majority in the data set. Too many negative examples seen in the training lead system to consider an NE
as normal noun group more confidently. Remarkably, the precisions reach very close to 100%. It is indeed
achieved for person names. It indicates the potential of our system and the quality of the classifier.
Full seed lists For the other experiments, we always use the seed lists composed of the most frequent NEs
found in ESTER data, but for this experiment, all the NEs we found in the corpus are included in the seed
lists. Table 8 shows the results of this experiment. There is an improvement of approximately 3% on each
data set. According to Table 3 and Table 4, the recalls increase over 10%. The more seeds we have, the
more the system can learn from the training data and the more confident it can assign a category to an NE.
NC
Type
LOC
PER
ORG
All
Prec.
0.706
0.628
0.413
0.639
Overall
Rec.
0.798
0.358
0.388
0.624
f
0.749
0.456
0.400
0.631
MD
Out-of-seed-lists
Prec. Rec.
f
0.356 0.554 0.434
0.500 0.259 0.341
0.287 0.269 0.278
0.359 0.368 0.364
Prec.
0.524
0.463
0.200
0.466
Overall
Rec.
0.545
0.227
0.081
0.344
f
0.534
0.305
0.116
0.395
Out-of-seed-lists
Prec. Rec.
f
0.284 0.339 0.309
0.442 0.219 0.293
0.184 0.076 0.107
0.297 0.213 0.248
Table 7: Experiments with full size seed lists
The PoS extension We only carry out experiment with the PoS extension on two sets of training data,
both extracted from the MD corpus. The overall performance decreases generally. Compare the results
11
Precision for Different Types of NEs
Recall for Different Types of NEs
1
0.5
All(Tagen)
LOC
PER
ORG
All
0.45
0.9
0.4
0.8
0.35
0.7
0.3
0.6
0.25
0.2
0.5
0.15
0.4
0.1
All(Tagen)
LOC
PER
ORG
All
0.3
0.2
100
1000
10000
0.05
0
100
100000
1000
Training Corpus
10000
100000
Training Corpus
(a) Precision from different training data
(b) Recall from different training data
Figure 4: Comparison of performance on different sizes of training sets
showed in Table 8 and Table 6. The f -score drops from 0.325 to 0.301 for 100 sentences, from 0.364 to
0.342. However, the performance on the out-of-seeds NEs increases consistently.The f -score climbs from
0.135 to 0.147 for 100 sentences and from 0.060 to 0.070 for 50, 000 sentences. The newly introduced PoS
features may have confused the system, they still provide evidence to a certain extent, especially when the
other features are not enough for a confident estimation (e.g. small training set). Further development on
integrating the PoS features should be able to bring improvements.
Size→
Type
LOC
PER
ORG
ALL
100
Prec.
0.366
0.239
0.125
0.296
Overall
Rec.
0.464
0.236
0.081
0.306
f
0.409
0.237
0.099
0.301
50,000
Out-of-seed-lists
Prec. Rec.
f
0.098 0.153 0.120
0.210 0.210 0.210
0.103 0.067 0.081
0.130 0.141 0.135
Prec.
0.891
1.000
0.400
0.882
Overall
Rec.
0.366
0.118
0.016
0.212
f
0.519
0.211
0.031
0.342
Out-of-seed-lists
Prec. Rec.
f
0.286 0.016 0.031
1.000 0.095 0.174
0.250 0.008 0.016
0.619 0.037 0.070
Table 8: Experiments with the PoS extension
5
Discussion
In Section 1, we said that the goal of this project is to adapt the English NER system to French and to carry
out an evaluation on the system. We successfully build the system and perform a group of experiments. The
major contributions of this project are the following:
• We study the accessible corpora by statistics.
• A new architecture is designed based on available resources
12
• A set of French transformation rules are developed and tested.
• The java library from the English system is adapted to French texts.
• We introduce a test bench, evaluate the system and also compare it to an existing system
Additionally, we also propose and implement several things that are not included in the original requirements:
• The transformer is separated from the classifier
• PoS features are included into training
• We can serialize a classifier model.
Meanwhile, there are still many problems to be solved in the future. They are discussed in the following
sections.
5.1
Problems
Most of the problems in the system are from the recognition part, namely the chunker and the transformer.
And, there are many other issues due to the lack of manpower/time and resources.
Recognition There are no accessible well-annotated corpus for us to re-train the chunker. We only use it
as a black box. The chunker turns out to be the most crucial link and the weakest link in the system. For
most of the cases, the shortest possible noun phrases are returned, but, in French, many named entities are
presented with complex noun phrases, which will be separated by the chunker. Unfortunately, the separation
cannot be recovered easily by the other modules.
The following two frequent types of NE’s are all destroyed by the chunker:
• NEs that start with a capitalized noun followed by lower-case adjective(s). For instance, “Union
soviétique” is annotated by the chunker as (UnionN OM )N P (soviétiqueADJ )AP , whereas the desirable
output (for our task) should be (UnionN OM soviétiqueADJ )N P .
• NE’s containing an all-lowercase preposition phrase as the modifier. Such NE’s are very common in
organisation names, like “Organisation mondiale du commerce”. The chunker splits this phrase into
(OrganisationN OM )N P (mondialeADJ )AP (duP RP :det commerceN OM )P P , instead of (OrganisationN OM
mondialeADJ (duP RP :det commerceN OM )P P )N P . It is indeed a dilemma, as sometimes only the
head of such complex noun phrase is a name.
It is a trivial task to identify potential NEs from noun chunks by transformation. One rule or a single order
of a set of rules may improve recognition of some NEs whereas damaging some others at the same time. We
test the transformation rules iteratively on the NC corpus. The errors in the output are manually inspected
in order to further improve the rules. Such development cycles are carried out multiple times during the
project. We tried to include rules as general as possible, but the decisions about the partially-good rules
are still too difficult to made. We can hardly estimate the performance of the final set of rules on any other
corpus. This may also explain the better overall performance on NC data.
13
Resources and time The English system uses three seed lists of approximately unambiguous 1800 seeds
each. This size is hard to achieve given that ESTER is the only source we can obtain the seeds. Besides,
the seeds from each category are not balanced. It is in fact reflected in the results. Similarly, the rule-based
classification models that have been implemented in the English system cannot be included in this project.
Addition of these models should result in improvement of the entire system.
As for the evaluation, the test documents we used are relatively too small, but it takes too long to annotate a
larger corpus, especially when only one of us knows French. These documents are automatically extracted.
The distribution of NEs in them may not be dense enough and balance enough to give a comprehensive
evaluation. There are not more than 700 NEs in either test document. That is, classifying one NE to a wrong
category may leads to a 1% decrease of the performance. On the other hand, we also consider many other
evaluation strategy, including building a gold standard and manually inspecting the results. The former is
not realistic in terms of time and the latter would make it difficult to perform multiple groups of experiment.
5.2
Futurework
This project provides a starting point for further work in the development of minimally supervised French
named entity recognition system. This kind of system is interesting for both research and applications, while
it also requires much more efforts. There are many possible directions.
The first possible direction is to fully adapt the architecture of the English system. Several components,
such as the chunker and the classifier models, were either excluded or replaced with alternatives in the
project. Implementing these components for French tasks will certainly help to improve the performance. It
also involves solving the problems mentioned in the previous section.
It will be interesting to look for a more comprehensive experimental methodology to evaluate the system.
A common reliable benchmark is important. It should be developed separately from the system development. The benchmarks should also be evaluated themselves. It is also important to perform more experiments to find out the impact of different parameters to the system. Consider the seed lists. Understanding
more clearly the relation between the size, the distribution and the source of the seeds should help to decision
making for real applications.
We have proposed a prototype to integrate the PoS features into the system. It is another possible direction
for the future. One way to built the full design is to weight the estimation from the different tries, particularly
the PoS tries. Linking the PoS tries to the original ones may also be a solution. Instead of increasing the
number of tries, we can directly embed the PoS features in the original tries, for instance, using the PoS
features as factors to the nodes.
References
[1] M. Collins and Y. Singer. Unsupervised models for named entity classification. In the Joint SIGDAT
Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pages
100–110, 1999.
[2] S. Cucerzan and D. Yarowsky. Language independent named entity recognition combining morphological and contextual evidence. In Joint SIGDAT Conference on EMNLP and VLC, 1999.
[3] S. Cucerzan and D. Yarowsky. Language independent named entity recognition combining morphological and contextual evidence. In 1999 Joint SIGDAT Conference on EMNLP and VLC, 1999.
14
[4] E. de la Clergerie. PASSAGE Large scale production of syntactic annotations to move forward. ANR
MDCA project proposal, 9 May 2006. http://atoll.inria.fr/passage/proposition.toc.html.
[5] ELRA.
Évaluation des Systèmes de Transcription
http://catalog.elra.info/product_info.php?products_id=999, 2006.
d’émissions
[6] M. Généreux.
The parameter file for the French chunker.
paris13.fr/ genereux/french-chunker-par-linux-3.1.bin.gz.
Radiophoniques.
http://www-lipn.univ-
[7] P. Koehn. Europarl: A parallel corpus for statistical machine translation. In MT Summit 2005, 2005.
[8] T. Poibeau. The multilingual named entity recognition framework. In EACL ’03: Proceedings of
the tenth conference on European chapter of the Association for Computational Linguistics, pages
155–158, Morristown, NJ, USA, 2003. Association for Computational Linguistics.
[9] H. Schmid. Probabilistic part-of-speech tagging using decision trees. In International Conference on
New Methods in Language Processing, Manchester, UK, 1994.
[10] C. Spurk. Ein minimal überwachtes Verfahren zur Erkennung generischer Eigennamen in freien Texten. Master’s thesis, Department of Computational Linguistics and Phonetics, Saarland University,
2006.
[11] R. Steinberger, B. Pouliquen, A. Widiger, C. Ignat, T. Erjavec, D. Tufis, and D. Varga. The jrc-acquis:
A multilingual aligned parallel corpus with 20+ languages. In the Fifth International Conference on
Language Resources and Evaluation, LREC’06, Genoa, Italy, 2006.
[12] Unknown. News Commentary corpus. http://www.statmt.org/wmt07/shared-task.html, 2007.
[13] R. Yangarber, W. Lin, and R. Grishman. Unsupervised learning of generalized names. In Proceedings
of the 19th international conference on Computational linguistics, pages 1–7, Morristown, NJ, USA,
2002. Association for Computational Linguistics.
15
Download