A DEVELOPMENT OF ALGORITHMS FOR THAI LANGUAGE DATA PROCESSING YUEN POOVARAWAN Associate Professor Dept. of Computer Engineering Kasetsart University, Thailand CHAIYONG WONGCHAISUWAT Assistant Professor Dept. of Computer Engineering Kasetsart University, Thailand PORNPOTE SURIYAWONG Researcher Dept. of Computer Engineering Kasetsart University, Thailand SUMMARY This paper presents the algorithms for processing Thai characters. The development should be applicable and suitable for Thai data processing. The algorithms include: Thai Keyboard Input, the leveling for Thai CRT display, the leveling for Thai printing, Thai word sorting and segmentation. INTRODUCTION This paper is written as a result of the study and development of Thai data processing on microcomputer. The study is done on the family of CP/M and MS-DOS Operating System by modifying both hardware and software of various brands of microcomputer. The appropriate algorithm for processing the data input from keyboard, the way of displaying and printing them are developed on the standard function in the virtual I/O level. The Thai language is quite unique. It differs significantly from other languages. There are 87 alphabetical characters appearing at four different levels. There are 5, 7, 72, and 3 characters at the first, second, third, and fourth level respectively. The following sample shows the physical structure of the four levels in Thai Writing system. The characters in each group are as shown below. The Thai Standard Code (TISI 620/1986) of these Thai characters are shown in Fig.1 Fig. 1. TISI 620/1986 The multi-level structure is dissolved into a linear structure when text is processed on a computer (i.e. on Thai word processor) for reasons of efficient usage of storage and ease of data preparation. A system is needed to convert between the linear structure and the original four level structure. Therefore, the Thai input text will consist of Thai characters in linear strings concatenated without any clearly defined boundaries. BASIC THAI ALGORITHMS Basic algorithms are needed for Thai data processing. 1. 2. 3. 4. 5. The The The The The Thai keyboard input algorithm. leveling for Thai display algorithm. leveling for Thai printing algorithm. sorting algorithm of Thai words. basic algorithm for segmenting Thai words. Thai keyboard Input algorithm The Thai computer keyboard layout was established by Thailand Industrial Standard Institute in 1988. This standard is called TISI 820/1988 as shown in Fig. 2. Fig. 2. Thai computer keyboard layout (TISI 820/1988) The information from keyboard is recorded in ASCII code including control code and character code. The program will change ASCII into Thai code by the table look up method. The value of the table corresponds with the layout of the Thai keyboard. The algorithm of keyboard input is shown as foolow: The leveling for Thai Displaying algorithm When these characters appear on the screen, their position are in block shapes. There is an unneedes space between the first and the second level, so the first level and the second level need to be combined. There are only three lines left in the processing. definition of these three lines are as follows: The In case of displaying, the algorithm will check the rules of the displayed characters. When inputing data, only single characters are recorded, therefore, it is necessary to create the state of “flag” to check whether the combination of characters is legitimate. In the state of flag, weight is used for determining the code combination algorithm is based on the principle of the sequence of code combination corresponding to its weight. Its speed depends on the search technique for code combination. This algorithm is shown as follows: The leveling for printing algorithm Algorithm for Thai printing differs from algorithm for screen displaying. The printing output requires three lines for one line of Thai language. Such format is different from recoding data in document file system since the Thai character data are arranged in strings. Before printing, the string will be broken up into substring for the printing of the upper line, the normal line and the lower line. The principle of combining the two upper lines is still used in this algorithm. According to this principle, Input_Str which is the string to be printed will be split into Upper_Str, Normal_Str and Lower_Str as the upper, normal and lower line respectively. The sorting algorithm for Thai words There is an accepted specific word order in Thai language. Since Thai words includes vowels and tone, alphabetization is made more complicated. The alphabetization of Thai word is based on the word order established by the Thai Royal Academy. An illustration is given as follow: To sort these words: เส เก แหง่ แท่ง เจ๊า เจ้า ข้อน ข่อนๆ ติงๆ ติ่ง The sorted order is: เก ข่อนๆ ข้อน เจ้า เจ๊า ติงๆ ติ่ง เส แหง่ แห่ ง An algorithm is needed to divide characters into three groups as follows: The first group includes tone marks consisting of ๆ ็ ็่ ็้ ็๊ ็ ็ The second group consists of vowels like ะ ็า ็า ็ิ ็ ็ ็ ็ ็ เ แ โ ใ ไ The third group contains all consonants plus ฤ, that is, ก.. ฮ The weight of each group is based on these rules. 1. The tone marks or the first group of character contains the lowest weight comparing to vowels and consonants. The weights of these 3 types of character are arrange as 1, 2, 3 respectively. 2. The weight of vowels is less than that of the consonants. All vowels are qualified as follows. w2 (1) > max [ the weight of possible combination in the first group] w2 (i+1) > w2(i) + max [the weight of possible combination in the first group] ( = 1,2,3.. ) 3. Consonants are also qualified as follows. w3(1) > max[the weight of possible combination in the first and second group] w3 (i+1) > w3(i) + max[the weight of possible combination in the first and second group] (I = 1,2,3...) This weighting principle is applied in the sorting algorithm for Thai alphabetization. The basic algorithm for segmenting Thai words. Besides the peculiar structure of 4 levels of characters, Thai words are continuously concatenated because spacing is placed between phrases and clauses only. In other words there are no word boundaries in Thai texts. Segmentation algorithm is needed to separate words. The algorithm used is based on longest mapping and backtracking techniques. In case that the words are not found yet in the dictionary after the scanning, the program will scan forwards character by character till the word matches the next word in the dictionary. During the scanning, additional criteria are used, based on these rules. These algorithms are written in the C language and are tested on microcomputers with a CPU of 80286 (10 MHz clock speed). The first three algorithms are used as a small editor. The result can be displayed on the screen and printed out to the printer. The result is satisfactory. Data in Thai characters can be input through the keyboard, displayed on the monitor screen and printed out correctly. As for the fourth algorithm for alphabetization, it is tested, using the quick sort algorithm. The result of many experiments is shown in Fig. 3. Fig. 3. The result of Thai sort algorithm For Thai segmentation algorithm, the 5400 words dictionary is used. These words are frequently used. The result of the experiment is that rate of success of Thai segmentation is about 98-100% success depends on the types of document. If all words in the document are in the dictionary, the segmentation can be done at nearly 100% success. If there are a lot of technical terms, such as in scientific and technology documents, there will be more mistakes. CONCLUSION This paper presents algorithms which are needed for Thai data processing. The applications of all softwares for the Thai language data require these algorithms. They can be applied as a filter program in all operating systems as well as any Thai data processing systems including the Thai word processor programs. REFERENCES 1. 2. 3. 4. This Royal Academy (1982), Thai Dictionary, Thai-Watanapanich Press, Bangkok, Thailand. Thai Industrial Standard Institue (1986), TISI 620-1986 Thai Character code for Computer, Bangkok, Thailand. Thai Industrial Standard Institue (1988), TISI 829-1988 Thai Computer Keyboard Lay Out, Bangkok, Thailand. Y. Poovarawan et al.(1984), A Development of Thai Algorithms, Proceeding of the 22nd National Conference, Kasetsart University, Thailand. (in Thai)