Information extraction from Chinese-English bitext 1. Introduction Information extraction (IE) is a type of information retrieval (IR) technology that automatically maps natural-language text into structured relational data, i.e. categorized and contextually and semantically well-defined data. At the core of an IE system is an extractor, which processes text; it overlooks irrelevant words and phrases and attempts to home in on entities and the relationships between them. [1] While IR retrieves relevant documents from collections, IE retrieves relevant information from documents. The significance of Information Extraction has been unveiled, as the amount of information available in unstructured form is experiencing exponential growth. The Internet is a case in point. Through information extraction, knowledge can be made more accessible by means of transformation into relational data, or by marking-up with XML tags. Existing IE techniques range from direct knowledge-based encoding (a human enters regular expressions or rules) to supervised learning (a human provides labeled training examples) to self-supervised learning (the system automatically finds and labels its own examples). [1] The term Named Entity (NE), was first introduced in the Message Understanding Conferences (MUC), it’s now a widely used term in Information Extraction (IE), Question Answering (QA) and other Natural Language Processing (NLP) applications. On the level of entity extraction, Named Entities (NE) were defined as proper names and quantities of interest. Person, organization, and location names were marked as well as dates, times, percentages, and monetary amounts.[2] Named entity recognition (NER) (also known as entity identification and entity extraction) is a subtask of information extraction that seeks to locate and classify atomic elements in text into predefined categories of named entities. A bitext is a merged document composed of two versions of a given text, usually in two different languages. An aligned bitext is produced by an alignment tool or aligner, which automatically aligns or matches the different versions of the same text, generally sentence by sentence. [3] Based on different approaches (statistical, linguistic, etc) to multilingual corpus processing, there are many tools that could be used. Unitex is a multi-platform system that involves the use of high quality language resources such as electronic lexicons and grammars, one of the few systems in the world which include both corpus-processing and resource-management functionality. NooJ is yet another tool for text processing, based on large-coverage dictionaries, as well as morphological and syntactic grammars described using graphs. 1 This thesis is about information extraction from Chinese/English bitext using Unitex system with a brief comparison with Nooj. 2. The Characteristics of Chinese Chinese is comprised of Pinyin (phonetics) and Hanzi (Chinese character). Pinyin is the romanization system for Chinese. It literally means “spelling sound”. Chinese characters are transcribed into Roman alphabets to help provide a visual representation of Chinese sounds. Nowadays pinyin is also used as one common typing method to enter Chinese characters on computers and cell phones. There are tens of thousands of Chinese characters. However, it is estimated that basic Chinese literacy can be achieved with knowledge of 2,000 to 3,500 characters. [7]The Chinese characters are logographic symbols. Each individual character represents an idea or thing. The combinations of characters express different meanings, which are usually but not necessarily the combinations of each character’s meaning. Chinese syntax is, in a way, similar to English. Sentences are often formed by stating a subject which is followed by a predicate. The predicate can be an intransitive verb, a transitive verb followed by a direct object, a linking verb followed by a predicate nominative, etc. The most common sentence structure has SVO (subject + verb+ object) word order. Some of its characteristics that are relevant to our project are: Chinese does not have tenses. Tenses are indicated by adverbs of time (‘tomorrow’, ‘just now’) or particles. Chinese does not use grammatical gender. There is no grammatical distinction between singular or plural, the distinction is accomplished by sentence structure. All words have only one grammatical form. No changes in the form of the word through inflection of verbs according to tense, mood and aspect. Chinese sentences are written as characters strings with no spaces between words. Word is a vague concept in Chinese, being defined as consisting of one or more characters representing a linguistic token. [8] Words in Chinese are actually not well marked in sentences, and there does not exist a commonly accepted Chinese lexicon.[9] 3. Bitext - parallelized text In our project, we took the Chinese and English translation texts from the famous Jules Verne novel 80 days around the world for exploration. In general, the bitext construction proceeds in two main steps: Segmentation of text into sentences. The alignment of the sentences. 2 1. Segmentation of text into sentences. The common methods of alignment of a bitext usually assume that before alignment both texts have been marked up, which means that the elements of its logical layout were explicitly and unambiguously annotated. [3]Extensible Markup Language (XML) is used to tag the logical layouts. The marked-up XML document can be viewed as a tree structure that has leaf nodes and labeled internal nodes. In this structure, each node is labeled according to the element name. And leaf nodes are either the elementary character chunks containing no tags or empty elements. The body of a typical TEI (Text Encoding Initiative) document may be represented as shown in Figure 1. [4] Div Div Div P Div Div … Div … P S S S Figure 1. Tree Representation of a structured document <Div>, <p>, <s> are the most common tags in the segmentation of the text, but <body>, <head> tags are also used under some circumstances. In every chapter in the Jules Verne novel 80 days around the world, we tagged them with a heading, and a main text which is divided into paragraphs, and segments, as illustrated in Figure 2. <body> <body> <div> <div> <head>第一章 <head>Chapter I IN WHICH PHILEAS FOGG 斐利亚·福克和路路通建立主仆关 AND PASSEPARTOUT ACCEPT EACH OTHER, THE 系</head> ONE AS MASTER, THE OTHER AS MAN</head> <p><seg> 1872 年,白林敦花园坊赛微乐街七 号(西锐登在 1814 年就死在这听住宅里),住着一 <p><seg>Mr. Phileas Fogg lived, in 1872, 位斐利亚·福克先生,这位福克先生似乎从来不做什 at 么显以引人注目的事,可是他仍然是伦敦改良俱乐 Gardens, the house in which Sheridan 部里最特别、最引人注意的一个会员。</seg></p> died in 1814. He was one of the most 3 No. 7, Saville Row, Burlington … … <p><seg> noticeable members of the Reform Club, 福克先生就只是改良俱乐部的会员, though he seemed always to avoid 瞧,和盘托出,仅此而已。</seg><seg>如果有人 attracting attention;</seg></p> 以为象福克这样古怪的人,居然也能参加象改良俱 … … 乐部这样光荣的团体,因而感到惊讶的话,人们就 <p><seg>Phileas Fogg was a member of the 会告诉他:福克是经巴林氏兄弟的介绍才被接纳入 Reform, and that was all.</seg><seg>The 会的。</seg><seg>他在巴林兄弟银行存了一笔款 way in which he got admission to this 子,因而获得了信誉,因为他的账面上永远有存款, exclusive club was simple enough. He was 他开的支票照例总是"凭票即付"。</seg></p> recommended by the Barings, with whom he … … had <p><seg> an open credit.</seg><seg> His cheques were regularly paid at sight 现在赛微乐街的寓所里只剩下路路 通一个人了。</seg></p> from his account current, which was </div> always flush.</seg></p> </body> … … <seg> Passepartout remained alone in the house in Saville Row.</seg></p> </div> </body> Figure 2. The segmentation of the bitext of a chapter from the novel 80 days around the world The methods of segmentation are applied to each of the two texts separately. The units are usually sentences, but they can also be larger, as paragraphs, or smaller, as words. Interestingly, one of the familiar circularities of computational linguistics, namely the fact that sentences have to be marked before processing, though that processing itself will determine what the sentences are, is present in the alignment problem as well. [3] Once sentences are tagged, segment alignment could be applied. 2. The alignment of the sentences. Tagged texts now can be processed to align the segments by alignment systems, for instance XAlign (developed within LORIA), which is based on statistical methods. The goal of the alignment is to establish 1:1 relations on the segment level. In our project, we used ACIDE system (Aligned Corpora Integrated Development Environment). It integrated Loria alignment tools (XAlign and Concordancier) and tools for creating TMX and HTML format of XML aligned texts. XAlign XAlign is a common tool for multilingual text alignment, i.e. the mapping from a text to its translation in another language at a certain granularity level (paragraph, sentence or expression), which is one of the essential components of the researches carried out in the field of multilingual information extraction and to answer the more industrial concern of localization. [5] 4 It’s based on a statistical model, and uses the hierarchical structure of documents. The texts are encoded in a XML format reflecting the hierarchy of divisions (recursively), paragraphs and sentences. Statistical models assume that blocks are approximately proportional in length to their equivalents (lengths being expressed in numbers of characters). Namely, a shorter sentence in the source text S tends to be translated into a shorter sentence in the target text T. The origin of this method is the Church-Gale index. [3] Among western languages, one sentence in S usually corresponds exactly to one sentence in T, but 1:N, N:N, N:1 relations are also allowed. For each character in P1 in T1, let the expected number of characters corresponds to P1 in P2 in T2 be C, C=l2/l1, where l1, l2 are the lengths of P1, P2, and S2 be the variance of this ratio. It means that one character in T1 is expected to be translated by C characters in T2. Then one sentence P2 in T2 corresponding to the translation of one sentence P1 in T1 will have the length l1C with variance l12S2. [6] An alignment algorithm developed on the basis of a Dynamic Time Warping (DTW)[4] is used to find the best alignment pairs from our multilingual texts at division, paragraph and sentence level. DTW is a method that allows a computer to find an optimal match between two given sequences with certain restrictions. The sequences are “warped” non-linearly in the time dimension to determine a measure of their similarity independent of certain non-linear variations in the time dimension. [wikipedia] When alignment is done, the two texts alignment information will be memorized by using three types of tags: Xptr: defines a pointer into an external document destination. Link: defines a link between elements or groups of elements. linkGroup: defines a set of links. [6] An example of an xml file that records the alignment information is as the following: <linkGrp crdate="empty" domains="b1 b1" evaluate="all" source="D:\Program Files\Acide\awork\vern-ch-en-01\vern-ch-en-01_f_id.xml" targFunc="null null" targOrder="Y" targType="seg" target="D:\Program Files\Acide\awork\vern-ch-en-01\vern-ch-en-01_s_id.xml" type="alignment"> <xptr from="ID (n52)" id="x1"></xptr> <xptr from="ID (n53)" id="x2"></xptr> <xptr from="ID (n8)" id="x3"></xptr> … <link id="l1" targets="n19 n20" type="linking"></link> <link id="l2" targets="n22 n23" type="linking"></link> … <link targets="n51 x1"></link> <link targets="n52 x2"></link> 5 <link targets="n8 x3"></link> … </linkGrp> In <xptr from="ID (n52)" id="x1"></xptr>, the sentences (segments, n1, n2…) in the target text are pointed to x1, x2, to distinguish from the segments in the source text. For segments in the source text which have N:1 or N:N relation with the target segments, <link id="l1" targets="n19 n20" type="linking"></link> makes the segments in source text a block, i.e., n19 and n20 in the source text are together relabeled as l1. And <link targets="n51 x1"></link> finally links the segments in the source text and target text. In this example, n52 in the target text is relabeled as x1 and linked to n51 in the source text. The excerpt above is actually what _fs.xml files, one of the 3 files produced by XAlign during the alignment process, look like in text editor. XAlign is accompanied by a multilingual Concordancer. You can view the content of the corresponding segments in Concordancer, which can change the pairing made by XAlign of 2 texts in different languages, in case there were mistakes in the previous process. The XML file shown above is what the Concordancer deals with. Alignment using Loria tools (XAlign and Concordancier) Figure 3 The input texts must be XML files. The 27 chapters from the novel 80 days around the world, whose English and Chinese versions were both segmented in the previous phase. The MultiAlign Properties file (in most cases Loria2\XAlign\properties\multialign.properties) specifies tags and the way they’ll be treated by Loria tools. 6 After specifying the paths to input texts (e.g C:\Chinese.xml, C:\English.xml), the common prefix of your output files (e.g. aligned) and path to output directory (e.g. D:\result), click the button ‘Align’ and the output window will show the messages from Loria tools. If everything runs without an error, Acide then creates 3 files: D:\result\aligned_f_id.xml, D:\result\aligned_s_id.xml, D:\result\aligned_fs.xml, and it will open the D:\result\aligned_fs.xml in Concordancier, in which the 2 versions of the text are roughly aligned, as is shown in Figure 4. The segments, numbered and identified by n1, n2…etc in both texts are matched according to their semantic equivalence. Figure 4. Concordancier However the bitext is not always well matched in XAlign, as there are some shortcomings in statistic models. Using lengths of sentences as indications of correspondence in the bitext space may work well among western language due to their substantial similarities, but when it comes to east-Asian languages and western language bitext, which have little in common, this method is not as efficient. Therefore a concordancier comes in handy to rematch the segments when there is discordance. We can click on the numbered segments to ‘unlink’ and ‘link’ the segments manually to rematch them. And clicking on Source ID’s column, we can sort translation units, and on particular ID to view the specified translation unit. When alignment is finally done, we can make use of the aligned bitext and apply information extraction to these resources, for that purpose, the corpus processing system Unitex, is of great use. 7 4. Unitex Unitex is a collection of programs developed for the analysis of texts in natural language by using linguistic resources and tools. With this tool, you can handle electronic resources such as electronic dictionaries and grammars and apply them. You can work at the levels of morphology, the lexicon and syntax. The main functions are: [10] building, checking and applying electronic dictionaries pattern matching with regular expressions and recursive transition networks applying lexicon-grammar tables handling ambiguity via the text automaton Preprocessing Texts After loading the text, Unitex offers to preprocess the text. You can choose the following operations: normalization of separators, splitting into sentences, normalization of non-ambiguous forms, tokenization and application of dictionaries. Take the normalization of non-ambiguous forms for example, the “‘re” in the sentence You’re a strange kid. is replaced by “ are”, (note the space in front of “are”.) to make it You are a strange kid. . You can replace these certain forms according to your own needs. Applying dictionaries to a text will build a subset of dictionaries that only covers forms of words that are present in the text. The dictionaries look like this: Figure 5. Word lists from an English text. The DELA dictionaries The electronic dictionaries distributed with Unitex use the DELA syntax (Dictionnaires Electroniques du LADL, LADL electronic dictionaries). This syntax describes the simple and compound lexical entries of a language with their 8 grammatical, semantic and inflectional information. The terms DELAF and DELAS are used to distinguish the inflected and non-inflected dictionaries, no matter they contain simple word, compound words or both. But to apply dictionaries, you need to obtain dictionaries first. [10] The dictionaries are essential for the functionality of corpus-processing systems like Unitex. Dictionaries for languages like English are already available, and are carried with the Unitex system, but for Chinese, there are still vacancies. To help the exploration of Chinese texts extraction, we built some simple Chinese dictionaries. So far we made lists including Countries, Currencies, Numbers, Measurement_units, etc. Some examples of the dictionaries: 厘米,N+Measurement+Unit+Length/centimeter (cm) 星期一,N+Day/Monday 英尺,N+Measurement+Unit+Length/foot (ft) 阿鲁巴岛,N+Country+Island/Aruba Island A useful dictionary of cities is also made to assist the exploitation of the novel 80 Days around the World. 伦敦,N+City/London 巴黎,N+City/Paris 都灵,N+City/Turin 苏伊士,N+City/Suez 布尔迪西,N+City/Brindisi 孟买,N+City/Bombay,Mumbai 加尔各答,N+City/Calcutta 香港,N+City/Hong Kong 上海,N+City/Shanghai 横滨,N+City/Yokohama 旧金山,N+City/San Francisco 奥马哈,N+City/Omaha 丹佛,N+City/Denver 匹兹堡,N+City/Pittsburgh 芝加哥,N+City/Chicago 纽约,N+City/New York 加的夫,N+City/Cardiff 利物浦,N+City/Liverpool But as mentioned in the section of the characteristics of Chinese, word is a vague concept. The combination of Chinese characters can yield to numerous words, phrases, which as the time goes on, whose number is growing even lager. And there is no commonly acknowledged lexicon to be applied. The making of Chinese dictionaries is 9 going to be hard work. Searching using regular expressions We can use regular expressions in Unitex in order to search for simple patterns. A regular expression can be:[10] • a token (book) or a lexical mask (<smoke.V>); • the concatenation of two regular expressions (he smokes); • the union of two regular expressions (Pierre+Paul); • the Kleene star of a regular expression (bye*). In our project, only graphs are used in the pattern searching. Searching using graphs Unitex can handle several types of graphs that correspond to the following uses: automatic inflection of dictionaries, preprocessing of texts, normalization of text automata, dictionary graphs, search for patterns, disambiguation and automatic graph generation. [10] We will elaborate the use of graph in the next section. Unitex can be viewed as a tool in which one can put linguistic resources and use them. Its technical characteristics are its portability, modularity, the possibility of dealing with languages that use special writing systems (e.g. many Asian languages), and its openness. Its linguistic characteristics are the ones that have motivated the elaboration of these resources: precision, completeness, and the taking into account of frozen expressions, most notably those which concern the enumeration of compound words.[10] 5. Application: Information Extraction from Chinese-English Bitext Graph is a powerful application. It makes the abstract expressions, patterns vivid and comprehensible. The building and editing of graphs is simple and easy using Unitex. Applying graph to the text, we can extract the information we want as long as we accurately describe the pattern in the graph. Graph Building for the Chinese Text Taking the expression of time in Chinese for example, here is the illustration of information extraction from Chinese Bitext using Unitex. In Chinese there are several ways to express time. For every hour on the hour, they are all expressed in the same form: time on the hour + 点/时(o’clock) + 正(optional). 10 For instance, 12:00 can be called as 十二点 or 十二点正, using Arabic numerals , it would be 12 点(正)。This is the same for other time on the hour, from 0-24. To express that pattern, a graph like this would be appropriate: Figure 6. Graph for the time on the hour. Figure 7. Subgraph chour The letters in grey, :hour and :chour are subgraphs to find the patterns of Arabic numbers from 0-24 and the same numbers written in Chinese character respectively. Using this graph, we can capture all the text that has a number from 0-24, Arabic or Chinese, followed by a Chinese character 点 or 时, and an optional 正. However the pattern Number +点 doesn’t just mean a special time, it can appear in the scoring of games as well, such as poker. Because 点, whiling carrying the meaning of “o’clock”, also means “point(s)”. Apart from this, in Chinese “一点” can mean “one o’clock”, or “a little”, and the latter is very commonly used. The system can’t differentiate between these two. In a word, this pattern searching can sometimes present more than we are looking for. Every hour on the hour is just one special case of ‘time’, in addition, here are other ways in which time is likely to be told. 12:05 十二点(零)五分, 12 点(0)5 分, twelve O five, and this ‘O’ is optional. 十二点过五分, five past twelve. 12:10 十二点十分, 12 点 10 分, twelve ten.(Like in English, this is the most common way of telling time, any time could be expressed this way.) 十二点过十分, ten past twelve. 12:15 十二点(过)十五分, 12 点(过)15 分 Twelve fifteen. Fifteen past twelve. 十二点一刻, 12 点 1 刻, 12 点一刻, a quarter past twelve. 12:30 十二点(过)三十分, 12 点(过)30 分. Twelve thirty. Thirty minutes past twelve. 十二点半, 12 点半, half past twelve. While 过 means “past”, which is often used in the first 30 minutes within an hour, for the minutes past 30, we use 差, which means “lack”. 12:45 11 十二点四十五分, 12 点 45 分, twelve forty-five. 十二点三刻, 12 点 3 刻, three quarters past twelve, (very rarely used). 一点差一刻, 1 点差 1 刻, a quarter to one. 一点差十五分, 1 点差 15 分, fifteen minutes to one. This almost wraps up all the possibilities in modern Chinese the way of telling time. The ancient Chinese time expression is a totally different concept, which is rarely used or seen except in the ancient literature. Unfortunately we have to bypass that to avoid the complexity. The following is what the graph, which contains all the time expression patterns, looks like. Figure 8. Graph for the modern Chinese Time Expression During the process of graph building, you can define the elements accurately, to try to avoid capturing the outliers. Or you can make a rough definition, knowing that chances are rare that you will get anything other than you want. In defining the hours and minutes of time, a rough definition could be [0,99]+点(hour)+[0,99]+分(minute). After all, you can still capture all the time in this pattern in the text, even though with the possibility of catching something like 34 点 67 分, which, however, doesn’t make sense at all under normal circumstances, thus hardly happens. And defining a range between 0 and 99 is much easier than defining a range between 0 and 24, 0 and 59 separately. Nevertheless, the creed of science is to be as precise as possible, within the complexity what we can handle. When the graph is drawn, the searching for the pattern becomes easy. Pattern Searching using Graphs Load the text that we want to exploit. 12 Figure 9. The first chapter from the novel in Chinese 80 days around the world Go to text->locate pattern, we will get this: Figure 9. Locate Pattern Set the path to the location of the graph on the computer, and “SEARCH”. Unitex will quick show the result of the pattern matching. Then we build concordance, and find the text matching the pattern, as shown in Figure 10. 13 Figure 10.Concordance of the pattern The extracted texts technically are all correct. But in the context, we can tell the first two are not the ‘time’ we were searching for. As explained earlier, “一点”could mean “one o’clock” or “a little”, and in this context, two of them both mean “a little”. Graph Building for the English text I planned to make a corresponding English time graph, but it turned out to be a bad example. There are too many outliers falling in the time pattern, and there is no easy way to eliminate them. I might need to choose another example. Build a graph for searching the English text is more or less the same. Take the for example. Then compare the results with each other, Chinese and English. 14 6. About NooJ NooJ Technology NooJ is evolved from INTEX, a former linguistic development environment that is written in C++. NooJ is based on the .NET platform, written in C#/.NET/Visual Studio computing environment, using all benefits of the “Component Programming” methodology, as well as the free automatic memory management. [11] Both NooJ and INTEX are based on Maurice Gross’ e-dictionaries, grammar lexicon and finite state transducers concepts. A finite-state transducer (FST) is a graph that represents a set of text sequences and then associates each recognized sequence with some analysis result. The text sequences are described in the input part of the FST; the corresponding results are described in the output part of the FST. Typically a syntactic FST represents a word sequences, and then produces linguistic information (such as its phrasal structure). A morphological FST represents sequences of letters that spell a word form, and then produces lexical information (such as a part of speech, a set of morphological, syntactic and semantic codes).[12] Linguistic Resources There are two types of linguistic resources used in Nooj: Dictionaries Dictionaries (.dic files) usually associate words or expressions with a set of information, such as a category (e.g. “Verb”), one or more inflectional and/or derivational paradigms, one or more semantic properties. [12] All the dictionaries built for Unitex can be easily transformed to be applied in NooJ. Grammars Grammars are used to represent a large gamut of linguistic phenomena, from the orthographical and the morphological levels, up to the syntagmatic and transformational syntactic levels.[12] NooJ morphological and syntactic grammars are structured libraries of graphs. On NooJ website, there are a dozen modules that people can download, including some Roman, Germanic, Slavic, Semitic and Asian languages, as well as Hungarian. The available Chinese module is based on Traditional Chinese, which is mainly used in Taiwan, Hong Kong and some other Asian countries. The module includes: A set of dictionaries for the literary text Dream in the Red Chamber : DRC-DIC.nod: covers the general vocabulary of the text DRC-GEO.nod: list of all the location names of the text 15 DRC-PROPERNAMES.nod: list of all the characters’ names of the text A set of dictionaries for modern Chinese: DIC.nod: general vocabulary GEO.nod: a list of location names PROPERNAMES.nod: a list of famous characters in Chinese history and literature LASTNAMES.nod: a list of Chinese last names PROVERBS.nod: a list of Chinese proverbs, idioms, etc. BOOKS.nod: a series of books’ titles However Simplified Chinese text can work well in this module as well, so long as it has simplified Chinese dictionaries and morphological grammar graphs to support the processing of the texts. There is no big difference between the Traditional and Simplified Chinese except that some of their corresponding characters are written differently, for example the simplified character 这(meaning “this”) corresponds to the traditional character 這. 7. Working with NooJ After compiling the dictionaries and building the morphological grammar graphs which are more or less the same as in Unitex, we can use them to explore the text. Load a text. Figure The text loaded in Nooj. After the text is loaded, we can right click and choose to do a “linguistic analysis” to the text. Then choose to “locate pattern”. There are four choices to locate the pattern. Set the regular expression as <N+Country> to find the country names in the text. 16 Figure Locate a Pattern And the result is: Figure The result of a regular expression pattern locating. Also Grammar maps can be another way of pattern locating. A graph built to express the time expression in Chinese: ………….. References 1. Oren Etzioni, Michele Banko, Stephen Soderland, and Daniel S. Weld. Open Information Extraction from the Web. Communications of the ACM, december 2008, vol. 51, no. 12, 68-74. 17 2. Nancy A. Chinchor. OVERVIEW OF MUC-7/MET-2. 3. Eric Laporte, Dusko Vitas, Cvetana Krstev. Preparation and Exploitation of Bilingual Texts. 4. Bonhomme, P., Romary, L. Parallel alignment of structured documents. Parallel Text Processing, Jean Véronis (Ed.) (2000) 233-253. 5. Project-Team L&D. Activity Report INRIA. 2004. 6. Bonhomme, P., Romary, L. (1995): The Lingua Parallel Concordancing Project: Managing Multilingual Texts for Educational Purpose. 7. Cate Coburn. Spotlight on Chinese, About the Chinese Language. Center for Applied Linguistics. http://www.cal.org/resources/discoverlanguages/chinese/index.html 8. Shiren Ye, Tat-Seng Chua, Liu Jimin. An Agent-based Approach to Chinese Named Entity Recognition. International Conference On Computational Linguistics, Proceedings of the 19th international conference on Computational linguistics - Volume 1. P1 – 7. 2002. 9. Jian Zhang, Jianfeng Gao, Ming Zhou. Extraction of Chinese Compound Words An Experimental Study on a Very Large Corpus. 2000. 10. Sébastien Paumier, et al. Unitex 2.1 User Manual. 11. http://www.nooj4nlp.net 12. Silbertein Max. NooJ Manual. 2002. 13. 18