A New Aspect of Sentence Boundary Detection Method for Turkish Özlem AKTAŞ Dokuz Eylül University Computer Engineering Department ozlem@cs.deu.edu.tr ABSTRACT “Natural Language Processing” (NLP) is a research area that is used for many different purposes and it becomes more popular continuously. Speech syntheses, speech recognition, machine translation, spelling correction are some of the application of NLP. For determining a language’s morphological specialties, it is needed to generate a corpus that represents the language and make some statistical and morphological analysis on it. The first step of generating such corpus is sentence boundary detection. This process is very complicated and hard to solve, but it is the most important part of the generating corpus. In this work, new method is developed to solve sentence boundary problem. The abbreviation list and rules generated for the sentence boundary detection are stored in an XML file; these files had provided successive results in sentence boundary detection. This new method will help researchers by separating sentences correctly and efficiently, about means of time and other costs. ÖZET “Doğal Dil İşleme” (DDİ) çok farklı amaçlarda kullanılan bir araştırma alanıdır ve günümüzde hızla yaygınlaşmaktadır. Konuşma analizi, konuşma tanımlama, imla doğrulama DDİ uygulama alanlarından sadece birkaçıdır. Bir dilin biçimbilimsel özelliklerinin belirlenebilmesi için o dili anlatan ve üzerinde istatistiksel ve biçimbilimsel analizlerin kolayca yapılabildiği bir derlem oluşturulması gereklidir. Böyle bir derlem oluşturmanın ilk aşaması cümle sonu belirleme işlemidir. Bu işlem oldukça karışık ve çözülmesi zor bir işlemdir, ancak derlem oluşturmanın en önemli aşamasıdır. Bu çalışmada cümle sonu belirleme problemini çözmek için yeni bir yöntem geliştirilmiştir. Cümle sonu belirleme işlemi için kullanılan kısaltma ve kural listeleri XML yapısında kaydedilmiştir; bu dosyalar cümle sonu belirleme işleminde başarı oranının artmasını sağlamıştır. Bu yeni yöntem cümlelerin doğru ve verimli biçimde ayrıştırılmalarını sağlayarak, araştırmacılara yardım edecektir. Keywords: Natural language processing, Turkish corpus, morphological analysis, sentence boundary detection. 1. INTRODUCTION “Natural Language” is the language naturally used by humans. Since 1940, researchers have worked for determining morphological specialties of natural languages. Shannon had investigated English and he published his first research about irregularity and predictability of English at 1948 [1]. Zipf had suggested a theorem that can be applied to all statistical distributions [2]. Because of not developing computer technology at 1940 – 1950 yet, there were not enough data that had not been collected and processed. Since computer technology has been developed fast, more data has been collected and new technologies are developed using Shannon and Zipf’s researches. Determining the natural language’s structure helps on data encryption processes, speech and recognition [3], optical character finding [4], spelling correction [5], etc. Also predicting the word will be written according to the word written before is very important process especially in communicating with people having obstacles. But while doing it, it must not be forgotten that predictable words may be excessive. So, only the words that have more probability to be added to article are predicted to the user according to the word written before [6]. “Natural Language Processing” (NLP) is a research area that is used for many different purposes and it becomes more popular continuously. In this area, computers are used to process natural language; it is used in academic searches and for commercial purposes. NLP can be defined as the construction of a computing system that processes and understands natural language. The word “understand” in this definition can be clarified such as the following; the observable behaviour of the system must make us assume that it is doing internally the same, or very similar, things that we do when we understand language [7]. In NLP, there are two kind of analysis used to generate and use a corpus: Morphological and Statistical Analysis [8]. Morphological analysis means that investigation of the words’ morphological status, such as determining sentence boundary, investigation of word types (verb, noun, adjective, etc.), and analyzing parts of the words (root, suffix or prefix). Statistical analysis can be done in two ways; on letters and words. Consonant and vowel letter placements, letter n-gram frequencies, relationship between letters such as letter positions according to each other and these kinds of analyses can be applied on the letters, called Letter Analysis. Investigation of number of letters in a word, the order of the letters in a word, word n-gram frequencies, word orders in a sentence and these kinds of the analyses can be applied on words, called Word Analysis. There are some definitions for corpus: • • • New developed sentence boundary detection algorithm works as in the following schema: Abbreviation List Rule List Corpus is a collection of linguistic data, either written texts or a transcription of recorded speech, which can be used as a starting-point of linguistic description or as a means of verifying hypotheses about a language. [9] A collection of naturally occurring language text, chosen to characterize a state or variety of a language. [10] A special database that is created from texts, used in Natural Language Processing area and allows all specialized processes such as finding and separating the words quickly. For many natural language processing tasks, identifying sentence boundaries is one of the most important prerequisites. Many available natural language processing tools do not perform a reliable detection of sentence boundaries. Algorithm Output XML file The rule list is created firstly, and stored in XML format to find end of sentence. Table 1 The rule list for sentence boundary detection End of Sentence Rules True L.U True L.# True ?.' True ?." True ?.( True ?.) True ?.- True ?./ True ?./ True U.L False L.L False ?., False #.L False #.' False #." False #.( The first step in generating corpus is “finding sentences”. Although Turkish sentences generally end with known punctuations such as ., …, !, ?, the process of finding end of sentences is very complex because of ambiguities in finding end of sentence process. For example; False #.) False #.- False #., False #.# • Uluslar, bu ekonomik buhran sonucunda 2. Dünya Savaşı’nı yaşamıştır. • Bu sezon kaybedilen maç sayısı 2. Dünya Kupası’na katılma şansı azalıyor. False #.U Using a list of end-of-sentence punctuation marks (e.g. “.”, “!”) is usable to find end of sentence in a sufficient way. But a period can be used in an abbreviation, as a decimal point, in e-mail addresses etc. Some examples are shown below: • • • She comes here by 5 p.m. on Saturday evening. www.cs.deu.edu.tr is our school’s web site. My e-mail address is ozlem@cs.deu.edu.tr. These are some ambiguities appeared in English in finding sentence boundary process. As in all other languages Turkish has such ambiguities, and this makes determining sentence boundary harder. In this study, a method is developed and explained about finding sentence boundaries for Turkish language. 1. SENTENCE BOUNDARY DETECTION METHOD FOR TURKISH In the first sentence, the “.” character is used for enumerate, but in the second sentence it indicates end of sentence. And after “.”, both of them have the same word that begins with uppercase. So, there is an ambiguity for the process of finding end of sentence. Input Texts XML format is created in triple group (e.g. “L.L”) as shown in the Table 1. The dot character in the middle of the group is shows the end of sentence characters. The left character shows the beginning character’s situation of the word before the punctuation, and the right character shows the beginning character’s situation of the word after the punctuation. In the following table, the characters’ meanings are shown. Table 2 The meanings of the characters in the sentence boundary rule list Character Meaning . EOS L Lowercase punctuations U Uppercase (. … ! ? ) # Number ? Any - -character , , ( ( ) ) / / ‘ ‘ “ “ By using these rules, making the end of sentence finding be easier is aimed. But, while the rules were created, some difficulties were appeared because of the Turkish language specialties, and these difficulties has been tried to be solved. Table 3 Example of abbrevation list in XML file <abbrevations> <abbr> <abbr> <abbr> <abbr> <abbr> <abbr> <abbr> <abbr> <abbr> <abbr> <abbr> <abbr> <abbr> <abbr> <abbr> <abbr> <abbr> A AA AAFSE AAM AB ABD ABS ADSL AET …… HAVAŞ HDD hek …… zf zm ZMO zool </abbr> </abbr> </abbr> </abbr> </abbr> </abbr> </abbr> </abbr> </abbr> </abbr> </abbr> </abbr> </abbr> </abbr> </abbr> </abbr> </abbr> </abbrevations> By using this abbreviation and rule lists, the texts were splited into sentences and output was written in an XML format again as shown in the following table. Table 4 Example of sentences in XML file File Name Par. No Sen. No Word Word Some examples for ambiguities in sentence boundary process for Turkish sentences are shown below: ID_396984_M.txt 0 0 0No Sigara ID_396984_M.txt 0 0 1 kullanımının • Cumhuriyetimizin 75. yılı coşkuyla kutlandı. • Tahta çıkan IV. Murat emirler yağdırdı. • Olimpiyatlar için uzun zamandır çalışan Ahmet koşuda 2. Uzun atlamada ise ancak 4. olabildi. • A. Mehmet YILDIZ size uğradı. • Alfabenin ilk harfi A. Mehmet’e bunu öğretmeniz gerekiyor. ID_396984_M.txt 0 0 2 azalması ID_396984_M.txt 0 0 3 konusunda ID_396984_M.txt 0 0 … … ID_396984_M.txt 0 0 23 başvurular ID_396984_M.txt 0 0 24 oldu ID_396984_M.txt 1 0 0 Prof. Dr. ID_396984_M.txt 1 0 1 Tuncer ID_396984_M.txt 1 0 … … ID_396984_M.txt 1 0 25 bildirdi ID_396984_M.txt 1 1 0 Tuncer ID_396984_M.txt 1 1 1 aksi ID_396984_M.txt 1 1 2 taktirde ID_396984_M.txt 1 1 … … ID_396984_M.txt 1 1 17 ifade ID_396984_M.txt 1 1 18 etti … … … … … In the first sentence, there is not end of sentence after the “.” punctuation. For abbreviations that make ambiguity in the sentences, an XML file was created, and abbreviation list was combined into this file as using <abbr> tag shown in the Table 3. Roman numbers are added into this file to get rid of the ambiguity such as “IV. Murat”. Abbreviation and rule lists were written into two files in a standard separated from the main program to allow users to make changes in these files easily and independent from the program. The new developed algorithm works as shown in the following figure. Parse Abbreviation and Rule XML files by using XML Document class of Visual Studio .NET. Insert abbreviation list into an array called “Abbrevations” and rules into a class called “Rule” in which the rules are stored in triple format with its result of end of sentence state ( E.g. L.U = true, …). Get abbreviation list file name Abbrevations nodes Get rule list file name Parse Rules Get file name to parse Open file Read a paragraph according to the “Carriage Return” character Paragraph= Read one line Control if end of file Paragraph NULL = Yes Write in XML file End of file is found, write the result into an XML file No Read one character Yes Char = NULL End of paragraph is found, write the result into an XML file and read another paragraph. No Char = letter or number No Word = Abbreviation Yes Add word into word list Yes No Add character to the word End of Sentence Figure 1 Flowchart of Algorithm Control if any rule in the rule list is matched Add sentence into Yes sentence list 3. RESULTS The following part of text is taken from a news letter: “Uluslararası Para Fon'u (IMF) heyeti, çalışmalarını Hazine Müsteşarlığı'nda gruplar halinde sürdürdü. Edinilen bilgiye göre, kısmen Türkiye masası şefi Rıza Moghadam ve Hazine Müsteşarı İbrahim Çanakçı'nın da katıldığı toplantılara, Maliye Bakanlığı, Merkez Bankası, BDDK, Özelleştirme İdaresi Başkanlığı, Kamu Bankaları gibi kuruluşların yetkilileri de katıldı. Bugünkü görüşmelerde, ödemeler dengesi, savunma sanayii, uluslararası rezervler, yerel yönetimler, yatırım programı, kamu mali yönetimi ve kontrol kanunu konuları tartışıldı.” After parsing algorithm, the result can be achieved in two formats. One is that only paragraphs and sentences in them are stored in XML file as showed in the following. <?xml version="1.0" encoding="UTF-8" standalone="yes" ?> <File OriginalName="MD_ID_396980_M.txt"> <Paragraph Index="0"> <Sentence Index="0">Uluslararası Para Fon’u (IMF) heyeti, çalışmalarını Hazine Müsteşarlığı’nda gruplar halinde sürdürdü.</Sentence> <Sentence Index="1">Edinilen bilgiye göre, kısmen Türkiye masası şefi Rıza Moghadam ve Hazine Müsteşarı İbrahim Çanakçı’nın da katıldığı toplantılara, Maliye Bakanlığı, Merkez Bankası, BDDK, Özelleştirme İdaresi Başkanlığı, Kamu Bankaları gibi kuruluşların yetkilileri de katıldı.</Sentence> </Paragraph> <Paragraph Index="1"> <Sentence Index="0">Bugünkü görüşmelerde, ödemeler dengesi, savunma sanayii, uluslararası rezervler, yerel yönetimler, yatırım programı, kamu mali yönetimi ve kontrol kanunu konuları tartışıldı.</Sentence> </Paragraph> </File> The other format is that paragraphs, sentences and words are stored in XML file as showed in the following. <?xml version="1.0" encoding="UTF-8" standalone="yes" ?> <File OriginalName="MD_ID_396980_M.txt"> <Paragraph Index="0"> <Sentence Index="0"> <Word Index="0">Uluslararası</Word> <Word Index="1">Para</Word> <Word Index="2">Fon’uWord> <Word Index="4">IMF</Word> <Word Index="5">heyeti</Word> <Word Index="6">çalışmalarını</Word> <Word Index="7">Hazine</Word> <Word Index="8">Müsteşarlığı’nda</Word> <Word Index="9">gruplar</Word> <Word Index="10">halinde</Word> <Word Index="11">sürdürdü</Word> </Sentence> <Sentence Index="1"> <Word Index="0">Edinilen</Word> <Word Index="1">bilgiye</Word> <Word Index="2">göre</Word> <Word Index="3">kısmen</Word> <Word Index="4">Türkiye</Word> <Word Index="5">masası</Word> <Word Index="6">şefi</Word> <Word Index="7">Rıza</Word> <Word Index="8">Moghadam</Word> <Word Index="9">ve</Word> <Word Index="10">Hazine</Word> <Word Index="11">Müsteşarı</Word> <Word Index="12">İbrahim</Word> <Word Index="13">Çanakçı’nın</Word> <Word Index="14">da</Word> <Word Index="15">katıldığı</Word> <Word Index="16">toplantılara</Word> <Word Index="17">Maliye</Word> <Word Index="18">Bakanlığı</Word> <Word Index="19">Merkez</Word> <Word Index="20">Bankası</Word> <Word Index="21">BDDK</Word> <Word Index="22">Özelleştirme</Word> <Word Index="23">İdaresi</Word> <Word Index="24">Başkanlığı</Word> <Word Index="25">Kamu</Word> <Word Index="26">Bankaları</Word> <Word Index="27">gibi</Word> <Word Index="28">kuruluşların</Word> <Word Index="29">yetkilileri</Word> <Word Index="30">de</Word> <Word Index="31">katıldı</Word> </Sentence> </Paragraph> … </File> These results are created in two formats to make easy and flexible usage for researchers to be able to use in different aims. 4. CONCLUSION AND FUTURE WORKS This new introduced method finds end of Turkish sentences correctly with the pre-determined rule list in an efficient way, since the sentences are written in formal way and there is no spelling faults in writing sentences. This method is aimed to be reference for researchers study about sentence boundary detection methods. Some ambiguities such as abbreviations and Roman numbers are solved by this work. The ambiguities that can not be solved by this method may be solved by using machine learning and some statistical analyses for Turkish in future works. Word types, root and suffixes of the words can be added into this structure easily because of its readability, flexibility and understandability. REFERENCES [1] Shannon C.E. (1948): A Mathematical Theory of Communication, The Bell System Technical Journal, 27:379-423, 623-656 pp. [2] Choi, S.W. (2000). Some Statistical Properties and Zipf’s Law in Korean Text Corpus. Journal of Quantitative Linguistics, 7:1, pp. 19- 30. [3] Nadas, A. (1984). Estimation of probabilities in the language model of the IBM speech recognition system. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32:4, pp. 859-861 [4] Kukich K. (1992). Technique for automatically correcting words in text. Periodical Issue Article of ACM Press, pp.377-439. [5] Church, K. & Gale, W. (1991). Probability Scoring for Spelling Correction. Statistics and Computing, pp.93-103. [6] Jurafsky, D. & Martin, J.H. (2000). Speech and Language Processing, Prentice Hall, pp. 193-199. [7] Güngördü Z. (1993). A lexical-functional grammar for Turkish. MSc Thesis. Computer Engineering Department, Bilkent University, Ankara. [8] Shannon, C.E. (1951). Prediction and Entropy of Printed English. The Bell System Technical Journal, 30:1, pp. 50-64. [9] Crystal,D. (1991). A Dictionary of Linguistics and Phonetics, Blackwell, 3rd Edition. [10] Sinclair,J. (1991). Corpus Concordance, Collocation. OUP.