(Named entities) Information extraction from Chinese

advertisement
Information extraction from Chinese-English bitext
1. Introduction
Information extraction (IE) is a type of information retrieval (IR) technology that
automatically maps natural-language text into structured relational data, i.e.
categorized and contextually and semantically well-defined data. At the core of an IE
system is an extractor, which processes text; it overlooks irrelevant words and phrases
and attempts to home in on entities and the relationships between them. [1] While IR
retrieves relevant documents from collections, IE retrieves relevant information from
documents.
The significance of Information Extraction has been unveiled, as the amount of
information available in unstructured form is experiencing exponential growth. The
Internet is a case in point. Through information extraction, knowledge can be made
more accessible by means of transformation into relational data, or by marking-up
with XML tags. Existing IE techniques range from direct knowledge-based encoding
(a human enters regular expressions or rules) to supervised learning (a human
provides labeled training examples) to self-supervised learning (the system
automatically finds and labels its own examples). [1]
The term Named Entity (NE), was first introduced in the Message Understanding
Conferences (MUC), it’s now a widely used term in Information Extraction (IE),
Question Answering (QA) and other Natural Language Processing (NLP) applications.
On the level of entity extraction, Named Entities (NE) were defined as proper names
and quantities of interest. Person, organization, and location names were marked as
well as dates, times, percentages, and monetary amounts.[2] Named entity
recognition (NER) (also known as entity identification and entity extraction) is a
subtask of information extraction that seeks to locate and classify atomic elements in
text into predefined categories of named entities.
A bitext is a merged document composed of two versions of a given text, usually in
two different languages. An aligned bitext is produced by an alignment tool or aligner,
which automatically aligns or matches the different versions of the same text,
generally sentence by sentence. [3]
Based on different approaches (statistical, linguistic, etc) to multilingual corpus
processing, there are many tools that could be used. Unitex is a multi-platform system
that involves the use of high quality language resources such as electronic lexicons
and grammars, one of the few systems in the world which include both
corpus-processing and resource-management functionality. NooJ is yet another tool
for text processing, based on large-coverage dictionaries, as well as morphological
and syntactic grammars described using graphs.
1
This thesis is about information extraction from Chinese/English bitext using Unitex
system with a brief comparison with Nooj.
2. The Characteristics of Chinese
Chinese is comprised of Pinyin (phonetics) and Hanzi (Chinese character). Pinyin is
the romanization system for Chinese. It literally means “spelling sound”. Chinese
characters are transcribed into Roman alphabets to help provide a visual
representation of Chinese sounds. Nowadays pinyin is also used as one common
typing method to enter Chinese characters on computers and cell phones.
There are tens of thousands of Chinese characters. However, it is estimated that basic
Chinese literacy can be achieved with knowledge of 2,000 to 3,500 characters. [7]The
Chinese characters are logographic symbols. Each individual character represents an
idea or thing. The combinations of characters express different meanings, which are
usually but not necessarily the combinations of each character’s meaning.
Chinese syntax is, in a way, similar to English. Sentences are often formed by stating
a subject which is followed by a predicate. The predicate can be an intransitive verb, a
transitive verb followed by a direct object, a linking verb followed by a predicate
nominative, etc. The most common sentence structure has SVO (subject + verb+
object) word order.
Some of its characteristics that are relevant to our project are:
 Chinese does not have tenses. Tenses are indicated by adverbs of time
(‘tomorrow’, ‘just now’) or particles.
 Chinese does not use grammatical gender.
 There is no grammatical distinction between singular or plural, the distinction is
accomplished by sentence structure.
 All words have only one grammatical form. No changes in the form of the word
through inflection of verbs according to tense, mood and aspect.
 Chinese sentences are written as characters strings with no spaces between words.
Word is a vague concept in Chinese, being defined as consisting of one or more
characters representing a linguistic token. [8] Words in Chinese are actually not well
marked in sentences, and there does not exist a commonly accepted Chinese
lexicon.[9]
3. Bitext - parallelized text
In our project, we took the Chinese and English translation texts from the famous
Jules Verne novel 80 days around the world for exploration.
In general, the bitext construction proceeds in two main steps:
 Segmentation of text into sentences.
 The alignment of the sentences.
2
1. Segmentation of text into sentences.
The common methods of alignment of a bitext usually assume that before alignment
both texts have been marked up, which means that the elements of its logical layout
were explicitly and unambiguously annotated. [3]Extensible Markup Language (XML)
is used to tag the logical layouts. The marked-up XML document can be viewed as a
tree structure that has leaf nodes and labeled internal nodes.
In this structure, each node is labeled according to the element name. And leaf nodes
are either the elementary character chunks containing no tags or empty elements. The
body of a typical TEI (Text Encoding Initiative) document may be represented as
shown in Figure 1. [4]
Div
Div
Div
P
Div
Div
…
Div
…
P
S
S
S
Figure 1. Tree Representation of a structured document
<Div>, <p>, <s> are the most common tags in the segmentation of the text, but
<body>, <head> tags are also used under some circumstances. In every chapter in the
Jules Verne novel 80 days around the world, we tagged them with a heading, and a
main text which is divided into paragraphs, and segments, as illustrated in Figure 2.
<body>
<body>
<div>
<div>
<head>第一章
<head>Chapter I IN WHICH PHILEAS FOGG
斐利亚·福克和路路通建立主仆关
AND PASSEPARTOUT ACCEPT EACH OTHER, THE
系</head>
ONE AS MASTER, THE OTHER AS MAN</head>
<p><seg>
1872 年,白林敦花园坊赛微乐街七
号(西锐登在 1814 年就死在这听住宅里),住着一
<p><seg>Mr. Phileas Fogg lived, in 1872,
位斐利亚·福克先生,这位福克先生似乎从来不做什
at
么显以引人注目的事,可是他仍然是伦敦改良俱乐
Gardens, the house in which Sheridan
部里最特别、最引人注意的一个会员。</seg></p>
died in 1814. He was one of the most
3
No.
7,
Saville
Row,
Burlington
… …
<p><seg>
noticeable members of the Reform Club,
福克先生就只是改良俱乐部的会员, though
he
seemed
always
to
avoid
瞧,和盘托出,仅此而已。</seg><seg>如果有人
attracting attention;</seg></p>
以为象福克这样古怪的人,居然也能参加象改良俱
… …
乐部这样光荣的团体,因而感到惊讶的话,人们就
<p><seg>Phileas Fogg was a member of the
会告诉他:福克是经巴林氏兄弟的介绍才被接纳入
Reform, and that was all.</seg><seg>The
会的。</seg><seg>他在巴林兄弟银行存了一笔款
way in which he got admission to this
子,因而获得了信誉,因为他的账面上永远有存款, exclusive club was simple enough. He was
他开的支票照例总是"凭票即付"。</seg></p>
recommended by the Barings, with whom he
… …
had
<p><seg>
an
open
credit.</seg><seg>
His
cheques were regularly paid at sight
现在赛微乐街的寓所里只剩下路路
通一个人了。</seg></p>
from his account current, which was
</div>
always flush.</seg></p>
</body>
… …
<seg>
Passepartout remained alone in
the house in Saville Row.</seg></p>
</div>
</body>
Figure 2. The segmentation of the bitext of a chapter
from the novel 80 days around the world
The methods of segmentation are applied to each of the two texts separately. The units
are usually sentences, but they can also be larger, as paragraphs, or smaller, as words.
Interestingly, one of the familiar circularities of computational linguistics, namely the
fact that sentences have to be marked before processing, though that processing itself
will determine what the sentences are, is present in the alignment problem as well. [3]
Once sentences are tagged, segment alignment could be applied.
2. The alignment of the sentences.
Tagged texts now can be processed to align the segments by alignment systems, for
instance XAlign (developed within LORIA), which is based on statistical methods.
The goal of the alignment is to establish 1:1 relations on the segment level.
In our project, we used ACIDE system (Aligned Corpora Integrated Development
Environment). It integrated Loria alignment tools (XAlign and Concordancier) and
tools for creating TMX and HTML format of XML aligned texts.
XAlign
XAlign is a common tool for multilingual text alignment, i.e. the mapping from a text
to its translation in another language at a certain granularity level (paragraph, sentence
or expression), which is one of the essential components of the researches carried out
in the field of multilingual information extraction and to answer the more industrial
concern of localization. [5]
4
It’s based on a statistical model, and uses the hierarchical structure of documents. The
texts are encoded in a XML format reflecting the hierarchy of divisions (recursively),
paragraphs and sentences. Statistical models assume that blocks are approximately
proportional in length to their equivalents (lengths being expressed in numbers of
characters). Namely, a shorter sentence in the source text S tends to be translated into
a shorter sentence in the target text T. The origin of this method is the Church-Gale
index. [3]
Among western languages, one sentence in S usually corresponds exactly to one
sentence in T, but 1:N, N:N, N:1 relations are also allowed. For each character in P1 in
T1, let the expected number of characters corresponds to P1 in P2 in T2 be C, C=l2/l1,
where l1, l2 are the lengths of P1, P2, and S2 be the variance of this ratio. It means that
one character in T1 is expected to be translated by C characters in T2. Then one
sentence P2 in T2 corresponding to the translation of one sentence P1 in T1 will have
the length l1C with variance l12S2. [6]
An alignment algorithm developed on the basis of a Dynamic Time Warping (DTW)[4]
is used to find the best alignment pairs from our multilingual texts at division,
paragraph and sentence level. DTW is a method that allows a computer to find an
optimal match between two given sequences with certain restrictions. The sequences
are “warped” non-linearly in the time dimension to determine a measure of their
similarity independent of certain non-linear variations in the time dimension. [wikipedia]
When alignment is done, the two texts alignment information will be memorized by
using three types of tags:
 Xptr: defines a pointer into an external document destination.
 Link: defines a link between elements or groups of elements.
 linkGroup: defines a set of links. [6]
An example of an xml file that records the alignment information is as the following:
<linkGrp crdate="empty" domains="b1 b1" evaluate="all" source="D:\Program
Files\Acide\awork\vern-ch-en-01\vern-ch-en-01_f_id.xml" targFunc="null null"
targOrder="Y" targType="seg" target="D:\Program
Files\Acide\awork\vern-ch-en-01\vern-ch-en-01_s_id.xml" type="alignment">
<xptr from="ID (n52)" id="x1"></xptr>
<xptr from="ID (n53)" id="x2"></xptr>
<xptr from="ID (n8)" id="x3"></xptr>
…
<link id="l1" targets="n19 n20" type="linking"></link>
<link id="l2" targets="n22 n23" type="linking"></link>
…
<link targets="n51 x1"></link>
<link targets="n52 x2"></link>
5
<link targets="n8 x3"></link>
…
</linkGrp>
In <xptr from="ID (n52)" id="x1"></xptr>, the sentences (segments, n1, n2…) in the
target text are pointed to x1, x2, to distinguish from the segments in the source text.
For segments in the source text which have N:1 or N:N relation with the target
segments, <link id="l1" targets="n19 n20" type="linking"></link> makes the
segments in source text a block, i.e., n19 and n20 in the source text are together
relabeled as l1. And <link targets="n51 x1"></link> finally links the segments in the
source text and target text. In this example, n52 in the target text is relabeled as x1
and linked to n51 in the source text.
The excerpt above is actually what _fs.xml files, one of the 3 files produced by
XAlign during the alignment process, look like in text editor.
XAlign is accompanied by a multilingual Concordancer. You can view the content of
the corresponding segments in Concordancer, which can change the pairing made by
XAlign of 2 texts in different languages, in case there were mistakes in the previous
process. The XML file shown above is what the Concordancer deals with.
Alignment using Loria tools (XAlign and Concordancier)
Figure 3
The input texts must be XML files. The 27 chapters from the novel 80 days around
the world, whose English and Chinese versions were both segmented in the previous
phase.
The
MultiAlign
Properties
file
(in
most
cases
Loria2\XAlign\properties\multialign.properties) specifies tags and the way they’ll be
treated by Loria tools.
6
After specifying the paths to input texts (e.g C:\Chinese.xml, C:\English.xml), the
common prefix of your output files (e.g. aligned) and path to output directory (e.g.
D:\result), click the button ‘Align’ and the output window will show the messages
from Loria tools. If everything runs without an error, Acide then creates 3 files:
D:\result\aligned_f_id.xml, D:\result\aligned_s_id.xml, D:\result\aligned_fs.xml, and
it will open the D:\result\aligned_fs.xml in Concordancier, in which the 2 versions of
the text are roughly aligned, as is shown in Figure 4. The segments, numbered and
identified by n1, n2…etc in both texts are matched according to their semantic
equivalence.
Figure 4. Concordancier
However the bitext is not always well matched in XAlign, as there are some
shortcomings in statistic models. Using lengths of sentences as indications of
correspondence in the bitext space may work well among western language due to
their substantial similarities, but when it comes to east-Asian languages and western
language bitext, which have little in common, this method is not as efficient.
Therefore a concordancier comes in handy to rematch the segments when there is
discordance. We can click on the numbered segments to ‘unlink’ and ‘link’ the
segments manually to rematch them. And clicking on Source ID’s column, we can
sort translation units, and on particular ID to view the specified translation unit.
When alignment is finally done, we can make use of the aligned bitext and apply
information extraction to these resources, for that purpose, the corpus processing
system Unitex, is of great use.
7
4. Unitex
Unitex is a collection of programs developed for the analysis of texts in natural
language by using linguistic resources and tools. With this tool, you can handle
electronic resources such as electronic dictionaries and grammars and apply them.
You can work at the levels of morphology, the lexicon and syntax. The main functions
are: [10]
 building, checking and applying electronic dictionaries
 pattern matching with regular expressions and recursive transition networks
 applying lexicon-grammar tables
 handling ambiguity via the text automaton
Preprocessing Texts
After loading the text, Unitex offers to preprocess the text. You can choose the
following operations: normalization of separators, splitting into sentences,
normalization of non-ambiguous forms, tokenization and application of dictionaries.
Take the normalization of non-ambiguous forms for example, the “‘re” in the sentence
You’re a strange kid. is replaced by “ are”, (note the space in front of “are”.) to make
it You are a strange kid. . You can replace these certain forms according to your own needs.
Applying dictionaries to a text will build a subset of dictionaries that only covers
forms of words that are present in the text. The dictionaries look like this:
Figure 5. Word lists from an English text.
The DELA dictionaries
The electronic dictionaries distributed with Unitex use the DELA syntax
(Dictionnaires Electroniques du LADL, LADL electronic dictionaries). This syntax
describes the simple and compound lexical entries of a language with their
8
grammatical, semantic and inflectional information. The terms DELAF and DELAS
are used to distinguish the inflected and non-inflected dictionaries, no matter they
contain simple word, compound words or both. But to apply dictionaries, you need to
obtain dictionaries first. [10]
The dictionaries are essential for the functionality of corpus-processing systems like
Unitex. Dictionaries for languages like English are already available, and are carried
with the Unitex system, but for Chinese, there are still vacancies.
To help the exploration of Chinese texts extraction, we built some simple Chinese
dictionaries. So far we made lists including Countries, Currencies, Numbers,
Measurement_units, etc. Some examples of the dictionaries:
厘米,N+Measurement+Unit+Length/centimeter (cm)
星期一,N+Day/Monday
英尺,N+Measurement+Unit+Length/foot (ft)
阿鲁巴岛,N+Country+Island/Aruba Island
A useful dictionary of cities is also made to assist the exploitation of the novel 80
Days around the World.
伦敦,N+City/London
巴黎,N+City/Paris
都灵,N+City/Turin
苏伊士,N+City/Suez
布尔迪西,N+City/Brindisi
孟买,N+City/Bombay,Mumbai
加尔各答,N+City/Calcutta
香港,N+City/Hong Kong
上海,N+City/Shanghai
横滨,N+City/Yokohama
旧金山,N+City/San Francisco
奥马哈,N+City/Omaha
丹佛,N+City/Denver
匹兹堡,N+City/Pittsburgh
芝加哥,N+City/Chicago
纽约,N+City/New York
加的夫,N+City/Cardiff
利物浦,N+City/Liverpool
But as mentioned in the section of the characteristics of Chinese, word is a vague
concept. The combination of Chinese characters can yield to numerous words, phrases,
which as the time goes on, whose number is growing even lager. And there is no
commonly acknowledged lexicon to be applied. The making of Chinese dictionaries is
9
going to be hard work.
Searching using regular expressions
We can use regular expressions in Unitex in order to search for simple patterns.
A regular expression can be:[10]
• a token (book) or a lexical mask (<smoke.V>);
• the concatenation of two regular expressions (he smokes);
• the union of two regular expressions (Pierre+Paul);
• the Kleene star of a regular expression (bye*).
In our project, only graphs are used in the pattern searching.
Searching using graphs
Unitex can handle several types of graphs that correspond to the following uses:
automatic inflection of dictionaries, preprocessing of texts, normalization of text
automata, dictionary graphs, search for patterns, disambiguation and automatic graph
generation. [10]
We will elaborate the use of graph in the next section.
Unitex can be viewed as a tool in which one can put linguistic resources and use them.
Its technical characteristics are its portability, modularity, the possibility of dealing
with languages that use special writing systems (e.g. many Asian languages), and its
openness. Its linguistic characteristics are the ones that have motivated the elaboration
of these resources: precision, completeness, and the taking into account of frozen
expressions, most notably those which concern the enumeration of compound
words.[10]
5. Application: Information Extraction from Chinese-English Bitext
Graph is a powerful application. It makes the abstract expressions, patterns vivid and
comprehensible. The building and editing of graphs is simple and easy using Unitex.
Applying graph to the text, we can extract the information we want as long as we
accurately describe the pattern in the graph.
Graph Building for the Chinese Text
Taking the expression of time in Chinese for example, here is the illustration of
information extraction from Chinese Bitext using Unitex.
In Chinese there are several ways to express time. For every hour on the hour, they
are all expressed in the same form:
time on the hour + 点/时(o’clock) + 正(optional).
10
For instance, 12:00 can be called as 十二点 or 十二点正, using Arabic numerals , it
would be 12 点(正)。This is the same for other time on the hour, from 0-24. To
express that pattern, a graph like this would be appropriate:
Figure 6. Graph for the time on the hour.
Figure 7. Subgraph chour
The letters in grey, :hour and :chour are subgraphs to find the patterns of Arabic
numbers from 0-24 and the same numbers written in Chinese character respectively.
Using this graph, we can capture all the text that has a number from 0-24, Arabic or
Chinese, followed by a Chinese character 点 or 时, and an optional 正. However the
pattern Number +点 doesn’t just mean a special time, it can appear in the scoring of
games as well, such as poker. Because 点, whiling carrying the meaning of “o’clock”,
also means “point(s)”. Apart from this, in Chinese “一点” can mean “one o’clock”, or
“a little”, and the latter is very commonly used. The system can’t differentiate
between these two. In a word, this pattern searching can sometimes present more than
we are looking for.
Every hour on the hour is just one special case of ‘time’, in addition, here are other
ways in which time is likely to be told.
12:05
十二点(零)五分, 12 点(0)5 分, twelve O five, and this ‘O’ is optional.
十二点过五分, five past twelve.
12:10
十二点十分, 12 点 10 分, twelve ten.(Like in English, this is the most common way of
telling time, any time could be expressed this way.)
十二点过十分, ten past twelve.
12:15
十二点(过)十五分, 12 点(过)15 分 Twelve fifteen. Fifteen past twelve.
十二点一刻, 12 点 1 刻, 12 点一刻, a quarter past twelve.
12:30
十二点(过)三十分, 12 点(过)30 分. Twelve thirty. Thirty minutes past twelve.
十二点半, 12 点半, half past twelve.
While 过 means “past”, which is often used in the first 30 minutes within an hour, for
the minutes past 30, we use 差, which means “lack”.
12:45
11
十二点四十五分, 12 点 45 分, twelve forty-five.
十二点三刻, 12 点 3 刻, three quarters past twelve, (very rarely used).
一点差一刻, 1 点差 1 刻, a quarter to one.
一点差十五分, 1 点差 15 分, fifteen minutes to one.
This almost wraps up all the possibilities in modern Chinese the way of telling time.
The ancient Chinese time expression is a totally different concept, which is rarely
used or seen except in the ancient literature. Unfortunately we have to bypass that to
avoid the complexity.
The following is what the graph, which contains all the time expression patterns,
looks like.
Figure 8. Graph for the modern Chinese Time Expression
During the process of graph building, you can define the elements accurately, to try to
avoid capturing the outliers. Or you can make a rough definition, knowing that
chances are rare that you will get anything other than you want. In defining the hours
and minutes of time, a rough definition could be [0,99]+点(hour)+[0,99]+分(minute).
After all, you can still capture all the time in this pattern in the text, even though with
the possibility of catching something like 34 点 67 分, which, however, doesn’t make
sense at all under normal circumstances, thus hardly happens. And defining a range
between 0 and 99 is much easier than defining a range between 0 and 24, 0 and 59
separately. Nevertheless, the creed of science is to be as precise as possible, within the
complexity what we can handle.
When the graph is drawn, the searching for the pattern becomes easy.
Pattern Searching using Graphs
Load the text that we want to exploit.
12
Figure 9. The first chapter from the novel in Chinese
80 days around the world
Go to text->locate pattern, we will get this:
Figure 9. Locate Pattern
Set the path to the location of the graph on the computer, and “SEARCH”. Unitex will
quick show the result of the pattern matching. Then we build concordance, and find
the text matching the pattern, as shown in Figure 10.
13
Figure 10.Concordance of the pattern
The extracted texts technically are all correct. But in the context, we can tell the first
two are not the ‘time’ we were searching for. As explained earlier, “一点”could mean
“one o’clock” or “a little”, and in this context, two of them both mean “a little”.
Graph Building for the English text
I planned to make a corresponding English time graph, but it turned out to be a bad
example. There are too many outliers falling in the time pattern, and there is no easy
way to eliminate them. I might need to choose another example.
Build a graph for searching the English text is more or less the same. Take the
for example.
Then compare the results with each other, Chinese and English.
14
6. About NooJ
NooJ Technology
NooJ is evolved from INTEX, a former linguistic development environment that is
written in C++. NooJ is based on the .NET platform, written in C#/.NET/Visual
Studio computing environment, using all benefits of the “Component Programming”
methodology, as well as the free automatic memory management. [11]
Both NooJ and INTEX are based on Maurice Gross’ e-dictionaries, grammar lexicon
and finite state transducers concepts.
A finite-state transducer (FST) is a graph that represents a set of text sequences and
then associates each recognized sequence with some analysis result. The text
sequences are described in the input part of the FST; the corresponding results are
described in the output part of the FST. Typically a syntactic FST represents a word
sequences, and then produces linguistic information (such as its phrasal structure). A
morphological FST represents sequences of letters that spell a word form, and then
produces lexical information (such as a part of speech, a set of morphological,
syntactic and semantic codes).[12]
Linguistic Resources
There are two types of linguistic resources used in Nooj:
 Dictionaries
Dictionaries (.dic files) usually associate words or expressions with a set of
information, such as a category (e.g. “Verb”), one or more inflectional and/or
derivational paradigms, one or more semantic properties. [12] All the dictionaries built
for Unitex can be easily transformed to be applied in NooJ.
 Grammars
Grammars are used to represent a large gamut of linguistic phenomena, from the
orthographical and the morphological levels, up to the syntagmatic and
transformational syntactic levels.[12] NooJ morphological and syntactic grammars are
structured libraries of graphs.
On NooJ website, there are a dozen modules that people can download, including
some Roman, Germanic, Slavic, Semitic and Asian languages, as well as Hungarian.
The available Chinese module is based on Traditional Chinese, which is mainly used
in Taiwan, Hong Kong and some other Asian countries. The module includes:
 A set of dictionaries for the literary text Dream in the Red Chamber :
DRC-DIC.nod: covers the general vocabulary of the text
DRC-GEO.nod: list of all the location names of the text
15
DRC-PROPERNAMES.nod: list of all the characters’ names of the text
 A set of dictionaries for modern Chinese:
DIC.nod: general vocabulary
GEO.nod: a list of location names
PROPERNAMES.nod: a list of famous characters in Chinese history and literature
LASTNAMES.nod: a list of Chinese last names
PROVERBS.nod: a list of Chinese proverbs, idioms, etc.
BOOKS.nod: a series of books’ titles
However Simplified Chinese text can work well in this module as well, so long as it
has simplified Chinese dictionaries and morphological grammar graphs to support the
processing of the texts. There is no big difference between the Traditional and
Simplified Chinese except that some of their corresponding characters are written
differently, for example the simplified character 这(meaning “this”) corresponds to
the traditional character 這.
7. Working with NooJ
After compiling the dictionaries and building the morphological grammar graphs
which are more or less the same as in Unitex, we can use them to explore the text.
Load a text.
Figure The text loaded in Nooj.
After the text is loaded, we can right click and choose to do a “linguistic analysis” to
the text. Then choose to “locate pattern”. There are four choices to locate the pattern.
Set the regular expression as <N+Country> to find the country names in the text.
16
Figure Locate a Pattern
And the result is:
Figure The result of a regular expression pattern locating.
Also Grammar maps can be another way of pattern locating.
A graph built to express the time expression in Chinese:
…………..
References
1. Oren Etzioni, Michele Banko, Stephen Soderland, and Daniel S. Weld. Open
Information Extraction from the Web. Communications of the ACM, december
2008, vol. 51, no. 12, 68-74.
17
2. Nancy A. Chinchor. OVERVIEW OF MUC-7/MET-2.
3. Eric Laporte, Dusko Vitas, Cvetana Krstev. Preparation and Exploitation of
Bilingual Texts.
4. Bonhomme, P., Romary, L. Parallel alignment of structured documents. Parallel
Text Processing, Jean Véronis (Ed.) (2000) 233-253.
5. Project-Team L&D. Activity Report INRIA. 2004.
6. Bonhomme, P., Romary, L. (1995): The Lingua Parallel Concordancing Project:
Managing Multilingual Texts for Educational Purpose.
7. Cate Coburn. Spotlight on Chinese, About the Chinese Language. Center for
Applied Linguistics. http://www.cal.org/resources/discoverlanguages/chinese/index.html
8. Shiren Ye, Tat-Seng Chua, Liu Jimin. An Agent-based Approach to Chinese
Named Entity Recognition. International Conference On Computational
Linguistics, Proceedings of the 19th international conference on Computational
linguistics - Volume 1. P1 – 7. 2002.
9. Jian Zhang, Jianfeng Gao, Ming Zhou. Extraction of Chinese Compound Words An Experimental Study on a Very Large Corpus. 2000.
10. Sébastien Paumier, et al. Unitex 2.1 User Manual.
11. http://www.nooj4nlp.net
12. Silbertein Max. NooJ Manual. 2002.
13.
18
Download