original submission

advertisement
A New Aspect of Sentence Boundary Detection Method for Turkish
Özlem AKTAŞ
Dokuz Eylül University
Computer Engineering Department
ozlem@cs.deu.edu.tr
ABSTRACT
“Natural Language Processing” (NLP) is a research area
that is used for many different purposes and it becomes
more popular continuously. Speech syntheses, speech
recognition, machine translation, spelling correction are
some of the application of NLP. For determining a
language’s morphological specialties, it is needed to
generate a corpus that represents the language and make
some statistical and morphological analysis on it. The
first step of generating such corpus is sentence boundary
detection. This process is very complicated and hard to
solve, but it is the most important part of the generating
corpus. In this work, new method is developed to solve
sentence boundary problem. The abbreviation list and
rules generated for the sentence boundary detection are
stored in an XML file; these files had provided successive
results in sentence boundary detection. This new method
will help researchers by separating sentences correctly
and efficiently, about means of time and other costs.
ÖZET
“Doğal Dil İşleme” (DDİ) çok farklı amaçlarda kullanılan
bir
araştırma
alanıdır
ve
günümüzde
hızla
yaygınlaşmaktadır.
Konuşma
analizi,
konuşma
tanımlama, imla doğrulama DDİ uygulama alanlarından
sadece birkaçıdır. Bir dilin biçimbilimsel özelliklerinin
belirlenebilmesi için o dili anlatan ve üzerinde istatistiksel
ve biçimbilimsel analizlerin kolayca yapılabildiği bir
derlem oluşturulması gereklidir.
Böyle bir derlem
oluşturmanın ilk aşaması cümle sonu belirleme işlemidir.
Bu işlem oldukça karışık ve çözülmesi zor bir işlemdir,
ancak derlem oluşturmanın en önemli aşamasıdır. Bu
çalışmada cümle sonu belirleme problemini çözmek için
yeni bir yöntem geliştirilmiştir. Cümle sonu belirleme
işlemi için kullanılan kısaltma ve kural listeleri XML
yapısında kaydedilmiştir; bu dosyalar
cümle sonu
belirleme işleminde başarı oranının artmasını sağlamıştır.
Bu yeni yöntem cümlelerin doğru ve verimli biçimde
ayrıştırılmalarını sağlayarak,
araştırmacılara yardım
edecektir.
Keywords: Natural language processing, Turkish corpus,
morphological analysis, sentence boundary detection.
1. INTRODUCTION
“Natural Language” is the language naturally used by
humans. Since 1940, researchers have worked for
determining morphological specialties of natural
languages. Shannon had investigated English and he
published his first research about irregularity and
predictability of English at 1948 [1]. Zipf had suggested
a theorem that can be applied to all statistical
distributions [2]. Because of not developing computer
technology at 1940 – 1950 yet, there were not enough
data that had not been collected and processed. Since
computer technology has been developed fast, more
data has been collected and new technologies are
developed using Shannon and Zipf’s researches.
Determining the natural language’s structure helps on
data encryption processes, speech and recognition [3],
optical character finding [4], spelling correction [5], etc.
Also predicting the word will be written according to
the word written before is very important process
especially in communicating with people having
obstacles. But while doing it, it must not be forgotten
that predictable words may be excessive. So, only the
words that have more probability to be added to article
are predicted to the user according to the word written
before [6].
“Natural Language Processing” (NLP) is a research
area that is used for many different purposes and it
becomes more popular continuously. In this area,
computers are used to process natural language; it is
used in academic searches and for commercial
purposes. NLP can be defined as the construction of a
computing system that processes and understands
natural language. The word “understand” in this
definition can be clarified such as the following; the
observable behaviour of the system must make us
assume that it is doing internally the same, or very
similar, things that we do when we understand language
[7].
In NLP, there are two kind of analysis used to generate
and use a corpus: Morphological and Statistical
Analysis [8]. Morphological analysis means that
investigation of the words’ morphological status, such
as determining sentence boundary, investigation of
word types (verb, noun, adjective, etc.), and analyzing
parts of the words (root, suffix or prefix). Statistical
analysis can be done in two ways; on letters and words.
Consonant and vowel letter placements, letter n-gram
frequencies, relationship between letters such as letter
positions according to each other and these kinds of
analyses can be applied on the letters, called Letter
Analysis. Investigation of number of letters in a word,
the order of the letters in a word, word n-gram
frequencies, word orders in a sentence and these kinds of
the analyses can be applied on words, called Word
Analysis. There are some definitions for corpus:
•
•
•
New developed sentence boundary detection algorithm
works as in the following schema:
Abbreviation
List
Rule List
Corpus is a collection of linguistic data, either
written texts or a transcription of recorded
speech, which can be used as a starting-point of
linguistic description or as a means of verifying
hypotheses about a language. [9]
A collection of naturally occurring language text,
chosen to characterize a state or variety of a
language. [10]
A special database that is created from texts,
used in Natural Language Processing area and
allows all specialized processes such as finding
and separating the words quickly.
For many natural language processing tasks, identifying
sentence boundaries is one of the most important
prerequisites. Many available natural language processing
tools do not perform a reliable detection of sentence
boundaries.
Algorithm
Output XML file
The rule list is created firstly, and stored in XML
format to find end of sentence.
Table 1 The rule list for sentence boundary detection
End of Sentence
Rules
True
L.U
True
L.#
True
?.'
True
?."
True
?.(
True
?.)
True
?.-
True
?./
True
?./
True
U.L
False
L.L
False
?.,
False
#.L
False
#.'
False
#."
False
#.(
The first step in generating corpus is “finding sentences”.
Although Turkish sentences generally end with known
punctuations such as ., …, !, ?, the process of finding end
of sentences is very complex because of ambiguities in
finding end of sentence process. For example;
False
#.)
False
#.-
False
#.,
False
#.#
• Uluslar, bu ekonomik buhran sonucunda 2.
Dünya Savaşı’nı yaşamıştır.
• Bu sezon kaybedilen maç sayısı 2. Dünya
Kupası’na katılma şansı azalıyor.
False
#.U
Using a list of end-of-sentence punctuation marks (e.g.
“.”, “!”) is usable to find end of sentence in a sufficient
way. But a period can be used in an abbreviation, as a
decimal point, in e-mail addresses etc. Some examples are
shown below:
•
•
•
She comes here by 5 p.m. on Saturday evening.
www.cs.deu.edu.tr is our school’s web site.
My e-mail address is ozlem@cs.deu.edu.tr.
These are some ambiguities appeared in English in
finding sentence boundary process. As in all other
languages Turkish has such ambiguities, and this makes
determining sentence boundary harder. In this study, a
method is developed and explained about finding sentence
boundaries for Turkish language.
1. SENTENCE BOUNDARY DETECTION
METHOD FOR TURKISH
In the first sentence, the “.” character is used for
enumerate, but in the second sentence it indicates end of
sentence. And after “.”, both of them have the same word
that begins with uppercase. So, there is an ambiguity for
the process of finding end of sentence.
Input
Texts
XML format is created in triple group (e.g. “L.L”) as
shown in the Table 1. The dot character in the middle of
the group is shows the end of sentence characters. The
left character shows the beginning character’s situation
of the word before the punctuation, and the right
character shows the beginning character’s situation of
the word after the punctuation. In the following table,
the characters’ meanings are shown.
Table 2 The meanings of the characters in the sentence
boundary rule list
Character
Meaning
.
EOS
L
Lowercase
punctuations
U
Uppercase
(. … ! ? )
#
Number
?
Any
-
-character
,
,
(
(
)
)
/
/
‘
‘
“
“
By using these rules, making the end of sentence finding
be easier is aimed. But, while the rules were created, some
difficulties were appeared because of the Turkish
language specialties, and these difficulties has been tried
to be solved.
Table 3 Example of abbrevation list in XML file
<abbrevations>
<abbr>
<abbr>
<abbr>
<abbr>
<abbr>
<abbr>
<abbr>
<abbr>
<abbr>
<abbr>
<abbr>
<abbr>
<abbr>
<abbr>
<abbr>
<abbr>
<abbr>
A
AA
AAFSE
AAM
AB
ABD
ABS
ADSL
AET
……
HAVAŞ
HDD
hek
……
zf
zm
ZMO
zool
</abbr>
</abbr>
</abbr>
</abbr>
</abbr>
</abbr>
</abbr>
</abbr>
</abbr>
</abbr>
</abbr>
</abbr>
</abbr>
</abbr>
</abbr>
</abbr>
</abbr>
</abbrevations>
By using this abbreviation and rule lists, the texts were
splited into sentences and output was written in an
XML format again as shown in the following table.
Table 4 Example of sentences in XML file
File Name
Par. No
Sen. No
Word
Word
Some examples for ambiguities in sentence boundary
process for Turkish sentences are shown below:
ID_396984_M.txt
0
0
0No
Sigara
ID_396984_M.txt
0
0
1
kullanımının
• Cumhuriyetimizin 75. yılı coşkuyla kutlandı.
• Tahta çıkan IV. Murat emirler yağdırdı.
• Olimpiyatlar için uzun zamandır çalışan Ahmet
koşuda 2. Uzun atlamada ise ancak 4. olabildi.
• A. Mehmet YILDIZ size uğradı.
• Alfabenin ilk harfi A. Mehmet’e bunu
öğretmeniz gerekiyor.
ID_396984_M.txt
0
0
2
azalması
ID_396984_M.txt
0
0
3
konusunda
ID_396984_M.txt
0
0
…
…
ID_396984_M.txt
0
0
23
başvurular
ID_396984_M.txt
0
0
24
oldu
ID_396984_M.txt
1
0
0
Prof. Dr.
ID_396984_M.txt
1
0
1
Tuncer
ID_396984_M.txt
1
0
…
…
ID_396984_M.txt
1
0
25
bildirdi
ID_396984_M.txt
1
1
0
Tuncer
ID_396984_M.txt
1
1
1
aksi
ID_396984_M.txt
1
1
2
taktirde
ID_396984_M.txt
1
1
…
…
ID_396984_M.txt
1
1
17
ifade
ID_396984_M.txt
1
1
18
etti
…
…
…
…
…
In the first sentence, there is not end of sentence after the
“.” punctuation.
For abbreviations that make ambiguity in the sentences,
an XML file was created, and abbreviation list was
combined into this file as using <abbr> tag shown in the
Table 3.
Roman numbers are added into this file to get rid of the
ambiguity such as “IV. Murat”. Abbreviation and rule
lists were written into two files in a standard separated
from the main program to allow users to make changes in
these files easily and independent from the program.
The new developed algorithm works as shown in the
following figure.
Parse Abbreviation and Rule XML files by using XML
Document class of Visual Studio .NET. Insert abbreviation
list into an array called “Abbrevations” and rules into a class
called “Rule” in which the rules are stored in triple format
with its result of end of sentence state ( E.g. L.U = true, …).
Get abbreviation list file name
Abbrevations
nodes
Get rule list file name
Parse Rules
Get file name to parse
Open file
Read a paragraph according to the “Carriage Return” character
Paragraph= Read one line
Control if end of file
Paragraph
NULL
=
Yes
Write in XML file
End of file is found, write
the result into an XML file
No
Read one
character
Yes
Char = NULL
End of paragraph is found, write the result into an
XML file and read another paragraph.
No
Char =
letter or
number
No
Word
=
Abbreviation
Yes
Add word into
word list
Yes
No
Add character to the word
End of
Sentence
Figure 1 Flowchart of Algorithm
Control if any rule in the rule list is
matched
Add sentence into
Yes
sentence list
3. RESULTS
The following part of text is taken from a news letter:
“Uluslararası Para Fon'u (IMF) heyeti,
çalışmalarını Hazine Müsteşarlığı'nda
gruplar halinde sürdürdü. Edinilen
bilgiye göre, kısmen Türkiye masası
şefi Rıza Moghadam ve Hazine
Müsteşarı İbrahim Çanakçı'nın da
katıldığı toplantılara, Maliye Bakanlığı,
Merkez Bankası, BDDK, Özelleştirme
İdaresi Başkanlığı, Kamu Bankaları gibi
kuruluşların yetkilileri de katıldı.
Bugünkü
görüşmelerde,
ödemeler
dengesi, savunma sanayii, uluslararası
rezervler, yerel yönetimler, yatırım
programı, kamu mali yönetimi ve
kontrol kanunu konuları tartışıldı.”
After parsing algorithm, the result can be achieved in two
formats. One is that only paragraphs and sentences in
them are stored in XML file as showed in the following.
<?xml version="1.0" encoding="UTF-8"
standalone="yes" ?>
<File OriginalName="MD_ID_396980_M.txt">
<Paragraph Index="0">
<Sentence Index="0">Uluslararası Para
Fon’u (IMF) heyeti, çalışmalarını Hazine
Müsteşarlığı’nda gruplar halinde
sürdürdü.</Sentence>
<Sentence Index="1">Edinilen bilgiye
göre, kısmen Türkiye masası şefi Rıza
Moghadam ve Hazine Müsteşarı İbrahim
Çanakçı’nın da katıldığı toplantılara, Maliye
Bakanlığı, Merkez Bankası, BDDK, Özelleştirme
İdaresi Başkanlığı, Kamu Bankaları gibi
kuruluşların yetkilileri de katıldı.</Sentence>
</Paragraph>
<Paragraph Index="1">
<Sentence Index="0">Bugünkü
görüşmelerde, ödemeler dengesi, savunma
sanayii, uluslararası rezervler, yerel yönetimler,
yatırım programı, kamu mali yönetimi ve kontrol
kanunu konuları tartışıldı.</Sentence>
</Paragraph>
</File>
The other format is that paragraphs, sentences and words
are stored in XML file as showed in the following.
<?xml version="1.0" encoding="UTF-8"
standalone="yes" ?>
<File OriginalName="MD_ID_396980_M.txt">
<Paragraph Index="0">
<Sentence Index="0">
<Word Index="0">Uluslararası</Word>
<Word Index="1">Para</Word>
<Word Index="2">Fon’uWord>
<Word Index="4">IMF</Word>
<Word Index="5">heyeti</Word>
<Word Index="6">çalışmalarını</Word>
<Word Index="7">Hazine</Word>
<Word Index="8">Müsteşarlığı’nda</Word>
<Word Index="9">gruplar</Word>
<Word Index="10">halinde</Word>
<Word Index="11">sürdürdü</Word>
</Sentence>
<Sentence Index="1">
<Word Index="0">Edinilen</Word>
<Word Index="1">bilgiye</Word>
<Word Index="2">göre</Word>
<Word Index="3">kısmen</Word>
<Word Index="4">Türkiye</Word>
<Word Index="5">masası</Word>
<Word Index="6">şefi</Word>
<Word Index="7">Rıza</Word>
<Word Index="8">Moghadam</Word>
<Word Index="9">ve</Word>
<Word Index="10">Hazine</Word>
<Word Index="11">Müsteşarı</Word>
<Word Index="12">İbrahim</Word>
<Word Index="13">Çanakçı’nın</Word>
<Word Index="14">da</Word>
<Word Index="15">katıldığı</Word>
<Word Index="16">toplantılara</Word>
<Word Index="17">Maliye</Word>
<Word Index="18">Bakanlığı</Word>
<Word Index="19">Merkez</Word>
<Word Index="20">Bankası</Word>
<Word Index="21">BDDK</Word>
<Word Index="22">Özelleştirme</Word>
<Word Index="23">İdaresi</Word>
<Word Index="24">Başkanlığı</Word>
<Word Index="25">Kamu</Word>
<Word Index="26">Bankaları</Word>
<Word Index="27">gibi</Word>
<Word Index="28">kuruluşların</Word>
<Word Index="29">yetkilileri</Word>
<Word Index="30">de</Word>
<Word Index="31">katıldı</Word>
</Sentence>
</Paragraph>
…
</File>
These results are created in two formats to make easy
and flexible usage for researchers to be able to use in
different aims.
4. CONCLUSION AND FUTURE WORKS
This new introduced method finds end of Turkish
sentences correctly with the pre-determined rule list in
an efficient way, since the sentences are written in
formal way and there is no spelling faults in writing
sentences.
This method is aimed to be reference for researchers
study about sentence boundary detection methods.
Some ambiguities such as abbreviations and Roman
numbers are solved by this work. The ambiguities that
can not be solved by this method may be solved by
using machine learning and some statistical analyses for
Turkish in future works. Word types, root and suffixes
of the words can be added into this structure easily
because
of
its
readability,
flexibility
and
understandability.
REFERENCES
[1] Shannon C.E. (1948): A Mathematical Theory of
Communication, The Bell System Technical Journal,
27:379-423, 623-656 pp.
[2] Choi, S.W. (2000). Some Statistical Properties
and Zipf’s Law in Korean Text Corpus.
Journal of Quantitative Linguistics, 7:1, pp. 19- 30.
[3] Nadas, A. (1984). Estimation of probabilities in the
language model of the IBM speech recognition
system. IEEE Transactions on Acoustics, Speech,
and Signal Processing, 32:4, pp. 859-861
[4] Kukich K. (1992). Technique for automatically
correcting words in text. Periodical Issue Article of
ACM Press, pp.377-439.
[5] Church, K. & Gale, W. (1991). Probability Scoring
for Spelling Correction. Statistics and Computing,
pp.93-103.
[6] Jurafsky, D. & Martin, J.H. (2000). Speech and
Language Processing, Prentice Hall, pp. 193-199.
[7] Güngördü Z. (1993). A lexical-functional grammar
for Turkish. MSc Thesis. Computer Engineering
Department, Bilkent University, Ankara.
[8] Shannon, C.E. (1951). Prediction and Entropy of
Printed English. The Bell System Technical Journal,
30:1, pp. 50-64.
[9] Crystal,D. (1991). A Dictionary of Linguistics and
Phonetics, Blackwell, 3rd Edition.
[10] Sinclair,J. (1991). Corpus Concordance,
Collocation. OUP.
Download