Minimally Supervised Named Entity Recognition in French Yu CHEN and Jennifer PLANUL Course Project for “NLP Applications” Submitted to: Dr. Claire Gardent Supervised by: Dr. Claire Gardent, PD Dr. Guenter Neumann (DFKI), Christian Spurk (DFKI) Contents 1 Introduction 1 2 System Architecture 1 3 Implementation 2 3.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 3.2 Chunking and Transformation Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.2.1 Chunking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.2.2 Noun Group Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.2.3 Transformation Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Named Entity Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.3.1 The Seed-driven Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.3.2 Seed lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.3 4 5 Evaluation 7 4.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 4.2 Corpus Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 4.3 Evaluation Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Discussion 12 5.1 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 5.2 Futurework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1 Introduction Named entities refer to those entities “for which one or many rigid designators stands for the referent” (cf. Wikipedia). Named entity recognition (NER) includes identifying and classifying the NEs. It is essential for many other natural language processing tasks. The NEs not only carries interesting semantic information but also facilitate the later processing as a multi-word NE can be considered as a while. In practice, getting NEs right can help a statistical machine translationi system improve up to 10% in terms of BLEU scores. Although it is only a subtask of information extraction (IE), it is NOT trivial to solve. There are several difficulties that have been discussed in the course. One of the important causes is that names are an open class so that any words can be used to name something. There are two main approaches for NER problems: hand-crafted rules and machine learning. Rule-based systems may be very efficient and accurate, especially for specific domains, but they are expensive to build. Among the machine learning method, the supervised approach requires a large amount of annotated data, which is also expensive. Many recent researches have been focusing on unsupervised or semi-supervised learning methods, e.g. seed-driven bootstrapping approach introduced in [1, 13, 2]. An English NER system has been developed by [10] and made available online. However, most researches in this area work on English and so far as we know, no semi-supervised method has been studied for French NER tasks. Even the other kinds of approaches are not so well studied for French. Both Sprout and Tagen [8] are applying finite state techinques on NER. We aim to adapt the English NER system kindly provided by Christian Spurk and Dr. Günter Neumann from DFKI to French and carry out evaluations on the new system. The adaption includes data selection, reconstruction of the architecture and the adaption of the existing codes. As for evaluation, we need to develop our own evaluation method. We are grateful for the support from the supervisors, Dr. Claire Gardent, Dr. Günter Neuman and Christian Spurk, through the whole projects. We have learnt a lot via the project. In this report, we first present the architecture of our system in Section 2. The implementation details are described in Section 3. Section 4 introduces our evaluation strategy together with all the results from the experiments. The report finishes with a brief discussion, including the problems and future directions, in Section 5. 2 System Architecture Raw text Classified Names Sentence Tokenized Chunked Noun Group Noun Group text Detection text Chunker Transformer GN Classifier Tokenizer NG Rules English Named Entity Recognition Transformation models seeds rules Figure 1: Architecture of the English Named Entity Recognition System 1 Fig. 1 shows the minimally supervised English NER system introduced by [10] (refered to as “the English system” later). The system are composed of two main components: the rule-based noun group chunker, which acts as the NE recoginizer, and the trie-based classifer trained with bootstrapping techniques. As for input and output, there are several interfaces designated for different formats of existing text resources, including some existing test suites. The architecture of our system is largely baseed on this system. However, there are still some differences between our system and the English system. Unlike the original self-contained solution, our final system includes several external modules, as we aim at adapting the English NER system in an economic way. Meanwhile, the interfaces between modules are then built according to the input/output format and encoding of each module. In addition, a new test suite is developed as a part of the system due to the lack of suitable evaluation resources. The main building blocks of the architecture (illustrated in Fig. 2) are: Preprocessing The original PASSAGE [4] corpus is made of XML documents. The next module takes raw texts as input, hence the XML tags must be removed. Besides, it is important to sample the corpora, as any of the corpora we can access is actually larger than necessary. Section 3.1 gives more details on this module. TreeTagger Instead of developing internal NP chunker, we take TreeTagger [9] as the chunker [6] in the system. Note, the TreeTagger is the only module that working on Latin-1 encoding (UTF8 for the other modules) so that encoding conversions are needed before and after applying the TreeTagger. The description of this module can be found in Section 3.2.1. Noun Group Transformer The rule-based noun group transformer is used to modified the noun phrases produced by the chunker into smaller noun groups that are more close to NE’s. In the original design, the transformer is an optional component inside the classifier. It is made separate from the classifier in order to reduce the computation time. This module is commendatory in our system as it can also eliminate the chunker’s errors to a certain extent. Section 3.2.2 The details of the applied rules are introduced in Section 3.2.3. Generalized Name Classifier The classifier in our system remains almost the same as the one implemented in the English system. More details can be found in Section 3.3.1. We also develop the seed lists for different types of NE’s used during the training of the classifier. The seeds are described in Section 3.3.2. Evaluation As mentioned, we have developed our own test suite for evluation. Further information is included in Section 4.3. 3 3.1 Implementation Preprocessing Many of the corpora are XML documents or documents in other format. The clean up of the corpus is essential for later processing. It includes removing the XML markups, the noisy characters, e.g. extra spaces, and sometimes identifying sentence boundaries. Moreover, a corpus may consists multiple files. The texts from different files are saved to one single file to facilitate the following procedures. 2 Raw text XML Cleanup XML Documents Evaluation Classified Names Transformed Chunked Noun Group Chunked TreeTagger text text GN Classifier Transformer Transformation Rules French Named Entity Recognition seed list Figure 2: Architecture of the French Named Entity Recognition System For each of the corpora, a clean-up script is implemented in Python with the re (Regular expression operations) module. Different operations are required for different corpora. For XML files in ESTER [5] corpus, we removed XML markup and also some noisy characters : ^^before unannotated names and * before oral corpus problems (repetition, stamming, missed word...). In the “Le Monde Diplomatique” [4] corpus, there are underscore line separators that need to be replaced and reference number between brackets to be remove. As for the XML files of the WikiNews [4] corpus, we need to remove many useless XML nodes, various types of markups, noisy characters. It is not feasible to experiment over the entire set of all accessible corpora. How the corpora are selected for the experiment is described in Section 4.2. Even if we narrow down to only a single corpus of a smaller size, there are still many problems. First, except for the ESTER corpus, none of the other corpora is designated for the NER tasks. Named entities are much less frequent as in the CoNLL corpus used for the English system. Sentnences without any NE’s may bring in too much noise into the training process. Second, if we want to study the performance of training over data of variable size, it is necessary to ensure we can draw conclusions over different subset of the corpus just as confidently as if the entire corpus had been used. The solution to the first problem is filtering. The NE’s in French usually contains at least one capitalized word, thus the capitalized words that are not at the beginning of the sentences are good indicators of sentences with some NE’s. These sentences are certainly kept for later use. It does not mean we simply remove all the others. It is still possible that the first word belongs to an NE which cannot be simply detected. A small amount of sentences without any NE’s can still give some negative evidence on the behavior of NE’s. Therefore, we only filter out a portion, eg. 80%, of the sentences without any capitalized words inside. As for the second problem, we employ a random sampling method to extract data from a corpus. In practice, the filtering and the sampling are performed on chunked text rather than raw text in order to avoid unnecessary repetition of chunking. The modules after the chunker does not consider contexts across sentence boundaries, therefore the samples do not have to be continuous. We first extract 500 sentences randomly from each (filtered) corpora as the test sets. Among the rest of the sentences, we, in the same manner, randomly select for the training set according to a given size. 3 3.2 Chunking and Transformation Interface 3.2.1 Chunking The recognition tasks usually start with identifying the noun phrases. In the English system, this is accomplished by a set of regular expression rules that split a sentence into noun groups (more general than noun phrases). Due to the time limit, we chose to use an existing French chunker included in the TreeTagger program. The other components of the system are working with UTF8 encoding, but the TreeTagger chunker only accepts plain Latin-1 texts as input and outputs the tagged texts in the same encoding. We need to convert the encoding before and after the application of TreeTagger. It is simply done using the unix program iconv. 3.2.2 Noun Group Transformation The noun phrases given by the chunker are not by a long way what we call named entities. In many cases, an NE is only a subgroup of the words inside respective noun phrases. To discover these words groups, we adapt the noun group transformation technique implemented in the English system. The noun group transformation component, a.k.a. the “transformer”, transform noun groups, i.e. noun phrases and the subgroups, into smaller ones. It consists of a set of transformation rules represented by regular expressions. Eventually, the chunks from the input are transformed in to a set of noun groups, some of which are NEs. During the transformation process, for each given noun group, all the rules are applied in a sequence according to the assigned priorities and the number of noun groups increases through the process. If there are too many rules or too many noun groups, the computation of this module delay the whole system significantly. When the system follows the original design to embed this process in the training cycle, it takes over 26 hours to train on the French “JRC-ACQUIS” corpus, which are composed of extremely long sentences. However, we cannot omit the transformation because this process not only locates true NEs but also helps to correct many errors from the chunker. It is not practical to redesign the component all over again, either. In order to minimize the affection, we isolate the transformer from the classifier to the maximal extent. The chunked texts are transformed into the same texts annotated with different sub-groups simply based on the rules. The resulting texts are saved directly on the hard disk so it can be reused at one’s disposal. For the present settings, training over a set of 200, 000 sentences can be finished in around 15 minutes. 3.2.3 Transformation Rules The construction of the transformation rules starts with the guideline given in the English system. Many changes are made because of the differences of the languages or the differences of the two systems. Many English constructions are not seen in French. For instance, there is no genitive mark in French. Compound noun phrases in English are expressed with noun phrases with post-modifiers in French. There is no phrases like "ORG coach/editor/chief/... PER" constructs to split. The following operations either have been done by the TreeTagger chunker or have never been seen in the corpora so that it is unnecessary to repeat them in the transformation process. • to remove the number tokens following/preceding tokens with a capital letter. • to split conjunctions that actually are two separate NGs. For example, “Abou Ghraib et Guantanamo” are annotated as two NPs by the chunker. 4 • to remove dates from the end of any NG, which are never seen in the data Many new rules are added particularly for French phenomena. We also try to use such rules to recover some of the errors the chunker has made. The rules are sorted by priority. The final rules are listed with examples as follows: 1. Removing the punctuations and spaces remaining at the beginning and the end of the NG: (Tiers Monde -)N G → (Tiers Monde)N G 2. Removing the punctuation inside the NG and split it into two NG: (un redressement spectaculaire ; néanmoins)N G → (un redressement spectaculaire)N G (néanmoins)N G 3. Removing determiners from the beginning of any NG, which includes three rules: • one for “tout(e)”, “tou(te)s”: (tous les autres États)N G → (les autres États)N G • another for articles: “le”, “un”, demonstrative and possessive adjectives: “cet”, “votre”: (les autres États)N G → (autres États)N G • and the last one for “autre(s)”: (autres États)N G → (États)N G 4. Removing titles which begin by an uppercase letter and do not belong to person names: (M. Blair)N G → (Blair)N G (Président Bush)N G → (Bush)N G 5. Tearing apart the lower case tokens that precede upper case tokens: (gouvernement Bush)N G → (gouvernement)N G (Bush)N G (co-président d’Alcatel)N G → (co-président)N G (Alcatel)N G (Chine de l’ après-Mao)N G → (Chine)N G (après-Mao)N G 6. Tearing apart location adjectives from country names (sometimes it is done be the chunker already): (Corée du Nord)N G → (Corée)N G (Nord)N G (Europe de l’ Est)N G → (Europe)N G (Est)N G (Guinée-équatoriale)N G → (Guinée)N G (équatoriale)N G Note tjat some rule may break NEs into pieces. When the first word in the phrase “mer du Nord” is mistakenly capitalized, it will be split into two parts: “Mer” and “Nord”. There are many special location names beginning with a lower case word, such as “golfe Persique”, “océan Atlantique”, which will be broken by the fifth rule. The rules may work differently on different corpora. 3.3 3.3.1 Named Entity Classifier The Seed-driven Classifier The Generalized Name (GN) classifier in the English system is implemented with a seed-based bootstrapping algorithm. Given a tokenized sentence, the classifier annotates each marked noun groups with estimated categories according to the model trained with unannotated texts and a small list of classified seeds. 5 The basic assumption of this classifier is the spelling (internal) features and the contextual (external) features of a noun group, a potential NE, imply the evidence for the category it belongs to [3]. This assumption still holds for French. For instance, most all-caps words fall into the category ORG, then “EDF” is more likely to be an organization name. Also, there are more person names start with “Jean-”, then both “JeanMarie Lehn” and “Jean-Pierre Koenig” tend to be used as person names. There are surely some negative examples, such as “Jean-Coutu” as an organization name, hence we also need the contextual features. If there are more location names following the word “en”, the noun “Bosnie” after “en” is also likely to be a location. Similar to the spelling features, many constructions can be seen with multiple categories. The two types of features should be considered interactively. The features of a noun group are encoded with four tries for an arbitrary number of categories. There are two morphological tries to keep the spelling features, one for prefix and another for suffix. The contextual features are stored in another two contextual tries, one for the suffix of left context and another for the prefix of the right context. Each of the nodes in the tries contains structured frequency information with respect to categories. The leaf nodes from different tries can be connected via links to indicate the relations between the internal patterns and the external patterns. The frequency of such links are also counted. The tries are first built from unannotated training corpus. When seeds are entered into the tries, the node frequency information is adjusted by bootstrapping. If a path of a seed in the morphological tries reach a sufficiently high threshold probability for certain category, then all the contexts linked to the trie of this path are re-estimated for the category. In return, if one of the contexts reach the threshold, all linked morphological tries need to be re-estimated. This recursive procedure continues until no more adjustment are possible. The modified tries are used as a model for classification. Given an NG from the target corpus, the classifier then assign the most probable category, the most frequent one for the node corresponding to the NG in the final tries. The core algorithm has been implemented in a langage-independent manner. When adapting the system to French, we restrict our modification to the minimum in order to avoid losing its function as an English NER system. As mentioned in previous sections, we do not use internal chunker. In order to connect the existing training component with the external modules, we create a new interface (cf. FrenchReader in java codes) for chunked texts so that the chunks containing in the input source texts can be passed together with the tokenized sentences to the tri building block. Accordingly, a new output module is designed for French text. We also add an option for transformation inside the training module to allow the module build tries without transforming the noun groups. The setting of this option depends whether the input has already been transformed or not. Apart from these three changes, the program work perfectly with French input. We also added a pair of functions to save a trained model (the set of tries) into a file and to retrieve the model from the file (cf. GeneralizedNameClassifier). In addition to the basic functions of the system, we provide an new extension to the training module. The input for this module in our system is the output from the TreeTagger chunker. It contains not only annotations of chunks but also part-of-speech (PoS) tags of all the tokens. Even if the PoS tags do not provide as strong evidence for NE classification as the spelling features and the contextual features, they can still be a backup reference when the decision is difficult to make. For example, it is less probable for a noun group with a PoS pattern “NAM PRP:det NOM” (Corée du Nord) to be a person name. This is implemented with additional tries for PoS tags (cf. PoSTrieBuilder class in the java source). Each tag from the French TreeTagger tagset is encoded with one letter as what information a tag holds is not changed by how it is represented. Additional four tries are built for the encoded PoS tag sequences and the bootstrapping algorithm is applied to the four tries in the same way. This is only a prototype implementation (cf. PoSGeneralizedName class in the java source). The PoS tag tries have not been linked to the original tries yet. It is not clear yet how these tries should react with the original ones. It is still 6 an optional extension rather than independent module. Some results from a preliminary experiment of this extension are discussed in Section . 3.3.2 Seed lists The seed lists are built from the only annotated corpus: ESTER. We use a Python script to identify the annotated NEs, compute the frequency of each NE, and then build the entry-level lists. Although the ESTER annotation provides more fine-grained categories and subcategories, we only consider the three typical NE categories: LOC (location), PER (persons) and ORG (organisations), in the experiment. The NEs of the GSP (geo-socio-politic groups) category in ESTER, in which country names are the majority, are also included in our LOC seeds. The other categories and the subcategories are ignored. There are 46167 annotated NE occurences which correspond to 6770 different NEs in ESTER corpus. Only 11.8% of all the NE variants, 497 NEs, occur more than 10 times, but the occurrences of these NEs account for 72.3% of all occurences. If ESTER is used for testing, a system trained with these NE’s as seeds can easily achieve good performance which cannot, in any way, reflect the real quality of the system. For this reason, we do not include ESTER for either training or testing but only use it to create the seed lists. We select the NEs that appear more than ten times as our seeds. Supposedly, these NEs can still be seen in the other news corpus, however would not be as frequent as in ESTER. In addition to automatic extraction, we also manually clean up the lists. Some NEs were not correctly classified in ESTER. Some are included into the lists by the script several times when followed by different modifiers. We also remove ambiguous NEs. Finally, we obtained three lists with the following numbers of seeds: 266 in LOC, 272 in PER and 82 in ORG. The full clean list including NEs that appear less than 10 times are also included in our experiment. 4 4.1 Evaluation Experiment Setup All the components have been tested on an Intel Xeon machine (CPU: 2.66GHz×8, RAM: 16G) with 64-bit Fedora Linux installed. The programs from us are supposed to be platform-independent, but they have never been tested in Windows or in MacOS. The entire system (from the XML documents to the classified outputs) requires approximately 3.5G of free hard disk space, not including the installation space of external software. The faster the CPU (1GHz recommended) and the more memory (2G recommended), the better the system is going to perform. It takes 258M RAM to train on 5000 sentences of moderate length (less than 30 tokens in average) and 2.26G to train on 200, 000 sentences. The required memory is almost linear to the size of training data. The system minimally requires a JDK version 1.5 or later and a Python version 2.4 or later. An installation of TreeTagger is necessary. Moreover, Tagen [8] is needed for a part of the evaluation. 4.2 Corpus Selection In this project, we are seeking corpora with more named entities to allow not only effective but also intensive training of the classifier. After manual inspection of the PASSAGE corpus and many other corpora, we chose to investigate 6 French corpora more thoroughly: the French section of “JRC-ACQUIS Multilingual Parallel Corpus” (Acquis) [11], the French section of “European Parliament Proceedings Parallel Corpus 7 1996-2006” (Europarl) [7], the French section of “News Commentary Corpus” (NC) [12], the “Évaluation des Systèmes de Transcription d’Émissions Radiophoniques” corpus (ESTER) [5], the “le Monde Diplomatique” (MD) and the French Wikinews corpus (Wikinews) from the PASSAGE [4] corpus Table 1 lists several parameters for each corpus, including the overall number of the sentences (S), the number of tokens (T ), the number of capitalized tokens that are not at the beginning of a sentence (TC ), the number of sentences containing such tokens (SC ), the number of noun phrase chunks and the number of tokens tagged with the “NAM” (name) tags (TN AM ). All these parameters are based on the outputs from TreeTagger. Corpus Aquis Ester Europarl MD NC Wikinews S 151356 40306 1323414 938535 42159 18917 T 8050576 1150133 41087716 24426013 1214523 592394 SC 86227 24772 686862 552019 23525 12594 TC 463249 71457 1460232 1593612 55416 47273 CN P 2104925 283807 9500316 6116565 293302 153152 TN AM 266212 65529 889720 1200536 44101 38194 Table 1: Corpus Parameters The parameters together give some implicit evidence of how suitable a corpus is for our tasks. Such evidence are more clear from the combinations of the parameters in Table 2. The depth of the tries built Corpus Aquis Ester Europarl MD NC wikinews T S TC S TC T TN AM S TN AM T 53.190 28.535 31.047 26.026 28.808 31.315 3.061 1.773 1.103 1.698 1.314 2.499 0.058 0.062 0.036 0.065 0.046 0.080 1.759 1.626 0.672 1.279 1.046 2.019 0.033 0.057 0.022 0.049 0.036 0.064 CN P S 13.907 7.041 7.179 6.517 6.957 8.096 CN P T TN AM CN P TC CN P 0.261 0.247 0.231 0.250 0.241 0.259 0.126 0.231 0.094 0.196 0.150 0.249 0.220 0.252 0.154 0.261 0.189 0.309 Table 2: Combination of the Parameters during the training process is linear to the length of the sentences. On the other side, only contexts of a maximal length 3 are considered. The extremely long sentences may increase the complexity unnecessarily. Hence, the corpora with moderate average sentence length ( TS ) are more desirable. The capitalized tokens inside a sentence and the ones tagged with a “NAM” tag are more possibly to be NE’s. The ratio of these tokens over all the sentences/chunks/tokens reflects the estimated density of NE’s in a corpus. In order to train the classifier more effectively, the higher ratio is better. According to the data shown in the tables, Acquis and Europarl are not suitable for their low NE density (per tokens) and the averagely long sentences in them. The ESTER corpus, a speech transcription, and the Wikinews with unrecoverable noise are removed from the list even though both of them seem promising in terms of the ratios. The annotated ESTER corpus is used for extracting the seed lists. Finally, the MD corpus and the NC corpus are chosen for both training and testing. 4.3 Evaluation Strategy We randomly extract a set of 500 sentences from each of the two selected corpora as the test documents for the evaluation. The two set are manually annotated with three categories of NEs. We adopt a simplified 8 MUC-like annotation scheme in the form of SGML text markup, in which a identified NE is marked up with inserted tags enclosed in angled brackets. The markup will have the following form: <enamex="CAT-VALUE"> text-string </enamex>, where “CAT-VALUE” can be one of “LOC”, “PER” and “ORG”. The system also gives output in this form. We also supply a script to transform the annotation in Tagen’s output into this form and to remove uninteresting tags, such as those for temporal expressions. The uniform annotation scheme in the files from different sources allows us to apply the same script (cf. evaluate.py) for the evaluation task. The manually annotated test documents are used as references. There are 457 NEs annotated in the MD test document with 346 not in the seed lists. As for the NC test document, 323 out of 662 NEs are not seeds. We compute the precision, the recall and the f -score of a system output based on the reference as usual. The scores are also computed for each of the NE categories. Furthermore, we zoom in on the results that do not appear in the seed lists so as to find out the learning ability of the seed-driven system. 4.4 Results Several experiments are performed on the selected data so as to evaluate the system from different aspect. The results obtained from the system are compared with an existing system Tagen, which is developed with finite-state techniques. It is sometimes impossible to train on the data as the target document. Hence, we carry out a “out-of-domain” test with the data to have a rough view of this situation. In addition, we also investigate the impact of two parameters to the performance. At the end, we have a preliminary evaluation on the proposed PoS extension to the classifier. The reults of these experiments are discussed in the rest of this section. Comparison with Tagen We train the system with 2000 sentences and test it with another 500 sentences from the same source. The seeds used are the most frequent ones. Table 3 and Table 4 compare the results from our system to those from Tagen. Tagen does not depend on any particular frequent NEs, so its performance on the less frequent NEs is rather close to the overall scores. The difference shown in our system is much significant. Particularly, the recall may decrease up to 90% (NC data) on the NEs not included in the seed list. The system outperforms Tagen on NC data, but gives worse results for MD data. We cannot draw absolute conclusion on which is the better, but our system is promising to beat Tagen eventually with further improvements. Our system has low recalls in general. It is mostly the result of the unreliable recognition. Many NEs cannot even reach the classifier. The system is more capable to work on location names and the person names than on organization names as the ORG seeds are much less than the others. Type LOC PER ORG All Prec. 0.905 0.667 0.576 0.826 Overall Rec. 0.637 0.219 0.254 0.473 FrenchNER Out-of-seed-lists f Prec. Rec. f 0.748 0.190 0.033 0.056 0.330 0.391 0.083 0.137 0.352 0.625 0.060 0.109 0.601 0.346 0.058 0.099 Tagen Prec. 0.841 0.541 0.091 0.730 Overall Rec. 0.634 0.241 0.022 0.429 Table 3: FrenchNER vs Tagen on NC data 9 f 0.723 0.333 0.036 0.540 Out-of-seed-lists Prec. Rec. f 0.680 0.579 0.625 0.480 0.222 0.304 0.091 0.036 0.051 0.522 0.310 0.389 Type LOC PER ORG All Prec. 0.704 0.562 0.455 0.661 Overall Rec. 0.393 0.164 0.041 0.243 FrenchNER Out-of-seed-lists f Prec. Rec. f 0.504 0.200 0.065 0.098 0.254 0.517 0.143 0.224 0.075 0.444 0.034 0.063 0.355 0.346 0.078 0.127 Tagen Prec. 0.767 0.391 0.250 0.587 Overall Rec. 0.545 0.327 0.065 0.363 f 0.637 0.356 0.103 0.449 Out-of-seed-lists Prec. Recx. f 0.649 0.387 0.485 0.378 0.324 0.349 0.250 0.068 0.107 0.459 0.260 0.332 Table 4: FrenchNER vs Tagen on MD data Out-of-domain tests The out-of-domain tests are also performed on the NC-MD combination.The left part of Table 5 lists the scores of the system trained on MD data but NC data, vice versa for the right part. Both sets are news. Ideally, it is better to use data from more different sources, e.g. training MD but testing on Europarl. When the system is targeted on a document different from in which it is trained, the performance decreases as expected. However, the decreases are not really obvious. In fact, the recalls are generally better when the system is trained on a different set. One reason could be that the system may have to consider more evidence if none of the evidence dominates, which should be expected when the targets are so different from what the system has seen in training. Type LOC PER ORG All Prec. 0.895 0.717 0.562 0.818 Overall Rec. 0.629 0.241 0.269 0.476 MD → NC Out-of-seed-lists f Prec. Rec. f 0.739 0.048 0.008 0.014 0.361 0.500 0.111 0.182 0.364 0.538 0.083 0.144 0.602 0.345 0.064 0.108 Prec. 0.679 0.525 0.143 0.546 Overall Rec. 0.406 0.191 0.049 0.258 NC→MD Out-of-seed-lists f Prec. Rec. f 0.508 0.224 0.089 0.127 0.280 0.486 0.171 0.254 0.073 0.125 0.043 0.064 0.351 0.270 0.098 0.144 Table 5: Training and Testing on different data Size 100 250 500 750 1000 2000 5000 10000 25000 50000 100000 200000 Prec. 0.370 0.505 0.567 0.599 0.617 0.732 0.845 0.879 0.897 0.897 0.897 0.897 LOC Rec. 0.467 0.458 0.449 0.431 0.409 0.400 0.387 0.387 0.387 0.387 0.387 0.387 f 0.413 0.480 0.501 0.501 0.492 0.517 0.530 0.537 0.540 0.540 0.540 0.540 Prec. 0.239 0.361 0.345 0.405 0.441 0.548 0.722 0.824 0.933 1.000 0.933 1.000 PER Rec. 0.236 0.200 0.182 0.136 0.136 0.155 0.118 0.127 0.127 0.127 0.127 0.127 f 0.237 0.257 0.238 0.204 0.208 0.241 0.203 0.220 0.224 0.226 0.224 0.226 Prec. 0.250 0.286 0.333 0.500 0.444 0.600 0.600 0.500 0.600 0.750 0.750 0.750 ORG Rec. 0.164 0.082 0.082 0.066 0.066 0.049 0.049 0.033 0.025 0.025 0.025 0.025 f 0.198 0.127 0.132 0.116 0.114 0.091 0.091 0.062 0.047 0.048 0.048 0.048 Prec. 0.319 0.450 0.492 0.558 0.572 0.689 0.809 0.847 0.889 0.904 0.897 0.904 ALL Rec. 0.330 0.295 0.287 0.263 0.252 0.247 0.232 0.230 0.228 0.228 0.228 0.228 f 0.325 0.357 0.362 0.357 0.350 0.364 0.361 0.361 0.362 0.364 0.363 0.364 Table 6: Training on different size of corpus Training of Variable Size We extract different size of training data from the MD corpus. The test data remains the same as that in the previous experiment. The results are given by Table 6. We expect the system to perform better with larger training set. It turns out not always true. Figure 3 is the learning curve of our 10 system. The f -score actually drops when training data is from almost the same size of the test document to twice larger. The performance goes steady rather than increasing more if the training data is larger than 10 times of the test document. f-score for different sizes of corpora 0.365 0.36 0.355 f-score 0.35 0.345 0.34 0.335 0.33 0.325 100 1000 10000 100000 Training Corpus Figure 3: Learning Curve of the System Furthermore, the curves of precisions and the recalls in Figure 4 also become stable after a certain size. Unlike the precision and f -score, the recalls decline when the training set gets larger. It is because more noise is introduced into the system with larger training corpus. Non-NE noun groups are nevertheless the majority in the data set. Too many negative examples seen in the training lead system to consider an NE as normal noun group more confidently. Remarkably, the precisions reach very close to 100%. It is indeed achieved for person names. It indicates the potential of our system and the quality of the classifier. Full seed lists For the other experiments, we always use the seed lists composed of the most frequent NEs found in ESTER data, but for this experiment, all the NEs we found in the corpus are included in the seed lists. Table 8 shows the results of this experiment. There is an improvement of approximately 3% on each data set. According to Table 3 and Table 4, the recalls increase over 10%. The more seeds we have, the more the system can learn from the training data and the more confident it can assign a category to an NE. NC Type LOC PER ORG All Prec. 0.706 0.628 0.413 0.639 Overall Rec. 0.798 0.358 0.388 0.624 f 0.749 0.456 0.400 0.631 MD Out-of-seed-lists Prec. Rec. f 0.356 0.554 0.434 0.500 0.259 0.341 0.287 0.269 0.278 0.359 0.368 0.364 Prec. 0.524 0.463 0.200 0.466 Overall Rec. 0.545 0.227 0.081 0.344 f 0.534 0.305 0.116 0.395 Out-of-seed-lists Prec. Rec. f 0.284 0.339 0.309 0.442 0.219 0.293 0.184 0.076 0.107 0.297 0.213 0.248 Table 7: Experiments with full size seed lists The PoS extension We only carry out experiment with the PoS extension on two sets of training data, both extracted from the MD corpus. The overall performance decreases generally. Compare the results 11 Precision for Different Types of NEs Recall for Different Types of NEs 1 0.5 All(Tagen) LOC PER ORG All 0.45 0.9 0.4 0.8 0.35 0.7 0.3 0.6 0.25 0.2 0.5 0.15 0.4 0.1 All(Tagen) LOC PER ORG All 0.3 0.2 100 1000 10000 0.05 0 100 100000 1000 Training Corpus 10000 100000 Training Corpus (a) Precision from different training data (b) Recall from different training data Figure 4: Comparison of performance on different sizes of training sets showed in Table 8 and Table 6. The f -score drops from 0.325 to 0.301 for 100 sentences, from 0.364 to 0.342. However, the performance on the out-of-seeds NEs increases consistently.The f -score climbs from 0.135 to 0.147 for 100 sentences and from 0.060 to 0.070 for 50, 000 sentences. The newly introduced PoS features may have confused the system, they still provide evidence to a certain extent, especially when the other features are not enough for a confident estimation (e.g. small training set). Further development on integrating the PoS features should be able to bring improvements. Size→ Type LOC PER ORG ALL 100 Prec. 0.366 0.239 0.125 0.296 Overall Rec. 0.464 0.236 0.081 0.306 f 0.409 0.237 0.099 0.301 50,000 Out-of-seed-lists Prec. Rec. f 0.098 0.153 0.120 0.210 0.210 0.210 0.103 0.067 0.081 0.130 0.141 0.135 Prec. 0.891 1.000 0.400 0.882 Overall Rec. 0.366 0.118 0.016 0.212 f 0.519 0.211 0.031 0.342 Out-of-seed-lists Prec. Rec. f 0.286 0.016 0.031 1.000 0.095 0.174 0.250 0.008 0.016 0.619 0.037 0.070 Table 8: Experiments with the PoS extension 5 Discussion In Section 1, we said that the goal of this project is to adapt the English NER system to French and to carry out an evaluation on the system. We successfully build the system and perform a group of experiments. The major contributions of this project are the following: • We study the accessible corpora by statistics. • A new architecture is designed based on available resources 12 • A set of French transformation rules are developed and tested. • The java library from the English system is adapted to French texts. • We introduce a test bench, evaluate the system and also compare it to an existing system Additionally, we also propose and implement several things that are not included in the original requirements: • The transformer is separated from the classifier • PoS features are included into training • We can serialize a classifier model. Meanwhile, there are still many problems to be solved in the future. They are discussed in the following sections. 5.1 Problems Most of the problems in the system are from the recognition part, namely the chunker and the transformer. And, there are many other issues due to the lack of manpower/time and resources. Recognition There are no accessible well-annotated corpus for us to re-train the chunker. We only use it as a black box. The chunker turns out to be the most crucial link and the weakest link in the system. For most of the cases, the shortest possible noun phrases are returned, but, in French, many named entities are presented with complex noun phrases, which will be separated by the chunker. Unfortunately, the separation cannot be recovered easily by the other modules. The following two frequent types of NE’s are all destroyed by the chunker: • NEs that start with a capitalized noun followed by lower-case adjective(s). For instance, “Union soviétique” is annotated by the chunker as (UnionN OM )N P (soviétiqueADJ )AP , whereas the desirable output (for our task) should be (UnionN OM soviétiqueADJ )N P . • NE’s containing an all-lowercase preposition phrase as the modifier. Such NE’s are very common in organisation names, like “Organisation mondiale du commerce”. The chunker splits this phrase into (OrganisationN OM )N P (mondialeADJ )AP (duP RP :det commerceN OM )P P , instead of (OrganisationN OM mondialeADJ (duP RP :det commerceN OM )P P )N P . It is indeed a dilemma, as sometimes only the head of such complex noun phrase is a name. It is a trivial task to identify potential NEs from noun chunks by transformation. One rule or a single order of a set of rules may improve recognition of some NEs whereas damaging some others at the same time. We test the transformation rules iteratively on the NC corpus. The errors in the output are manually inspected in order to further improve the rules. Such development cycles are carried out multiple times during the project. We tried to include rules as general as possible, but the decisions about the partially-good rules are still too difficult to made. We can hardly estimate the performance of the final set of rules on any other corpus. This may also explain the better overall performance on NC data. 13 Resources and time The English system uses three seed lists of approximately unambiguous 1800 seeds each. This size is hard to achieve given that ESTER is the only source we can obtain the seeds. Besides, the seeds from each category are not balanced. It is in fact reflected in the results. Similarly, the rule-based classification models that have been implemented in the English system cannot be included in this project. Addition of these models should result in improvement of the entire system. As for the evaluation, the test documents we used are relatively too small, but it takes too long to annotate a larger corpus, especially when only one of us knows French. These documents are automatically extracted. The distribution of NEs in them may not be dense enough and balance enough to give a comprehensive evaluation. There are not more than 700 NEs in either test document. That is, classifying one NE to a wrong category may leads to a 1% decrease of the performance. On the other hand, we also consider many other evaluation strategy, including building a gold standard and manually inspecting the results. The former is not realistic in terms of time and the latter would make it difficult to perform multiple groups of experiment. 5.2 Futurework This project provides a starting point for further work in the development of minimally supervised French named entity recognition system. This kind of system is interesting for both research and applications, while it also requires much more efforts. There are many possible directions. The first possible direction is to fully adapt the architecture of the English system. Several components, such as the chunker and the classifier models, were either excluded or replaced with alternatives in the project. Implementing these components for French tasks will certainly help to improve the performance. It also involves solving the problems mentioned in the previous section. It will be interesting to look for a more comprehensive experimental methodology to evaluate the system. A common reliable benchmark is important. It should be developed separately from the system development. The benchmarks should also be evaluated themselves. It is also important to perform more experiments to find out the impact of different parameters to the system. Consider the seed lists. Understanding more clearly the relation between the size, the distribution and the source of the seeds should help to decision making for real applications. We have proposed a prototype to integrate the PoS features into the system. It is another possible direction for the future. One way to built the full design is to weight the estimation from the different tries, particularly the PoS tries. Linking the PoS tries to the original ones may also be a solution. Instead of increasing the number of tries, we can directly embed the PoS features in the original tries, for instance, using the PoS features as factors to the nodes. References [1] M. Collins and Y. Singer. Unsupervised models for named entity classification. In the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pages 100–110, 1999. [2] S. Cucerzan and D. Yarowsky. Language independent named entity recognition combining morphological and contextual evidence. In Joint SIGDAT Conference on EMNLP and VLC, 1999. [3] S. Cucerzan and D. Yarowsky. Language independent named entity recognition combining morphological and contextual evidence. In 1999 Joint SIGDAT Conference on EMNLP and VLC, 1999. 14 [4] E. de la Clergerie. PASSAGE Large scale production of syntactic annotations to move forward. ANR MDCA project proposal, 9 May 2006. http://atoll.inria.fr/passage/proposition.toc.html. [5] ELRA. Évaluation des Systèmes de Transcription http://catalog.elra.info/product_info.php?products_id=999, 2006. d’émissions [6] M. Généreux. The parameter file for the French chunker. paris13.fr/ genereux/french-chunker-par-linux-3.1.bin.gz. Radiophoniques. http://www-lipn.univ- [7] P. Koehn. Europarl: A parallel corpus for statistical machine translation. In MT Summit 2005, 2005. [8] T. Poibeau. The multilingual named entity recognition framework. In EACL ’03: Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics, pages 155–158, Morristown, NJ, USA, 2003. Association for Computational Linguistics. [9] H. Schmid. Probabilistic part-of-speech tagging using decision trees. In International Conference on New Methods in Language Processing, Manchester, UK, 1994. [10] C. Spurk. Ein minimal überwachtes Verfahren zur Erkennung generischer Eigennamen in freien Texten. Master’s thesis, Department of Computational Linguistics and Phonetics, Saarland University, 2006. [11] R. Steinberger, B. Pouliquen, A. Widiger, C. Ignat, T. Erjavec, D. Tufis, and D. Varga. The jrc-acquis: A multilingual aligned parallel corpus with 20+ languages. In the Fifth International Conference on Language Resources and Evaluation, LREC’06, Genoa, Italy, 2006. [12] Unknown. News Commentary corpus. http://www.statmt.org/wmt07/shared-task.html, 2007. [13] R. Yangarber, W. Lin, and R. Grishman. Unsupervised learning of generalized names. In Proceedings of the 19th international conference on Computational linguistics, pages 1–7, Morristown, NJ, USA, 2002. Association for Computational Linguistics. 15