a development of algorithms for thai language data processing

advertisement
A DEVELOPMENT OF ALGORITHMS FOR THAI LANGUAGE DATA
PROCESSING
YUEN POOVARAWAN
Associate Professor
Dept. of Computer Engineering
Kasetsart University, Thailand
CHAIYONG WONGCHAISUWAT
Assistant Professor
Dept. of Computer Engineering
Kasetsart University, Thailand
PORNPOTE SURIYAWONG
Researcher
Dept. of Computer Engineering
Kasetsart University, Thailand
SUMMARY
This paper presents the algorithms for processing Thai
characters. The development should be applicable and suitable for
Thai data processing. The algorithms include: Thai Keyboard Input,
the leveling for Thai CRT display, the leveling for Thai printing,
Thai word sorting and segmentation.
INTRODUCTION
This paper is written as a result of the study and development
of Thai data processing on microcomputer. The study is done on the
family of CP/M and MS-DOS Operating System by modifying both hardware
and software of various brands of microcomputer. The appropriate
algorithm for processing the data input from keyboard, the way of
displaying and printing them are developed on the standard function
in the virtual I/O level.
The Thai language is quite unique. It differs significantly
from other languages. There are 87 alphabetical characters appearing
at four different levels. There are 5, 7, 72, and 3 characters at
the first, second, third, and fourth level respectively.
The following sample shows the physical structure of the four
levels in Thai Writing system.
The characters in each group are as shown below.
The Thai Standard Code (TISI 620/1986) of these Thai characters
are shown in Fig.1
Fig. 1.
TISI 620/1986
The multi-level structure is dissolved into a linear structure
when text is processed on a computer (i.e. on Thai word processor)
for reasons of efficient usage of storage and ease of data
preparation. A system is needed to convert between the linear
structure and the original four level structure. Therefore, the Thai
input text will consist of Thai characters in linear strings
concatenated without any clearly defined boundaries.
BASIC THAI ALGORITHMS
Basic algorithms are needed for Thai data processing.
1.
2.
3.
4.
5.
The
The
The
The
The
Thai keyboard input algorithm.
leveling for Thai display algorithm.
leveling for Thai printing algorithm.
sorting algorithm of Thai words.
basic algorithm for segmenting Thai words.
Thai keyboard Input algorithm
The Thai computer keyboard layout was established by Thailand
Industrial Standard Institute in 1988. This standard is called TISI
820/1988 as shown in Fig. 2.
Fig. 2. Thai computer keyboard layout (TISI 820/1988)
The information from keyboard is recorded in ASCII code
including control code and character code. The program will change
ASCII into Thai code by the table look up method. The value of the
table corresponds with the layout of the Thai keyboard. The
algorithm of keyboard input is shown as foolow:
The leveling for Thai Displaying algorithm
When these characters appear on the screen, their position are
in block shapes. There is an unneedes space between the first and
the second level, so the first level and the second level need to be
combined.
There are only three lines left in the processing.
definition of these three lines are as follows:
The
In case of displaying, the algorithm will check the rules of
the displayed characters. When inputing data, only single characters
are recorded, therefore, it is necessary to create the state of
“flag” to check whether the combination of characters is legitimate.
In the state of flag, weight is used for determining the code
combination algorithm is based on the principle of the sequence of
code combination corresponding to its weight. Its speed depends on
the search technique for code combination. This algorithm is shown
as follows:
The leveling for printing algorithm
Algorithm for Thai printing differs from algorithm for screen
displaying. The printing output requires three lines for one line of
Thai language. Such format is different from recoding data in
document file system since the Thai character data are arranged in
strings. Before printing, the string will be broken up into
substring for the printing of the upper line, the normal line and the
lower line. The principle of combining the two upper lines is still
used in this algorithm.
According to this principle, Input_Str which is the string to
be printed will be split into Upper_Str, Normal_Str and Lower_Str as
the upper, normal and lower line respectively.
The sorting algorithm for Thai words
There is an accepted specific word order in Thai language.
Since Thai words includes vowels and tone, alphabetization is made
more complicated. The alphabetization of Thai word is based on the
word order established by the Thai Royal Academy. An illustration is
given as follow:
To sort these words:
เส เก แหง่ แท่ง เจ๊า เจ้า ข้อน ข่อนๆ ติงๆ ติ่ง
The sorted order is:
เก ข่อนๆ ข้อน เจ้า เจ๊า ติงๆ ติ่ง เส แหง่ แห่ ง
An algorithm is needed to divide characters into three groups
as follows:
The first group includes tone marks consisting of
ๆ ็ ็่ ็้ ็๊ ็ ็
The second group consists of vowels like
ะ ็า
็า ็ิ ็ ็ ็ ็ ็ เ แ โ ใ ไ
The third group contains all consonants plus
ฤ,
that is,
ก.. ฮ
The weight of each group is based on these rules.
1.
The tone marks or the first group of character contains the
lowest weight comparing to vowels and consonants. The
weights of these 3 types of character are arrange as 1, 2,
3 respectively.
2.
The weight of vowels is less than that of the consonants.
All vowels are qualified as follows.
w2 (1) > max [ the weight of possible combination in the
first group]
w2 (i+1) > w2(i) + max [the weight of possible combination
in the first group] ( = 1,2,3.. )
3.
Consonants are also qualified as follows.
w3(1) > max[the weight of possible combination in the first
and second group]
w3 (i+1) > w3(i) + max[the weight of possible combination
in the first and second group] (I = 1,2,3...)
This weighting principle is applied in the sorting algorithm
for Thai alphabetization.
The basic algorithm for segmenting Thai words.
Besides the peculiar structure of 4 levels of characters, Thai
words are continuously concatenated because spacing is placed between
phrases and clauses only. In other words there are no word
boundaries in Thai texts.
Segmentation algorithm is needed to separate words. The
algorithm used is based on longest mapping and backtracking
techniques.
In case that the words are not found yet in the dictionary
after the scanning, the program will scan forwards character by
character till the word matches the next word in the dictionary.
During the scanning, additional criteria are used, based on these
rules.
These algorithms are written in the C language and are tested
on microcomputers with a CPU of 80286 (10 MHz clock speed). The
first three algorithms are used as a small editor. The result can be
displayed on the screen and printed out to the printer. The result
is satisfactory. Data in Thai characters can be input through the
keyboard, displayed on the monitor screen and printed out correctly.
As for the fourth algorithm for alphabetization, it is tested,
using the quick sort algorithm. The result of many experiments is
shown in Fig. 3.
Fig. 3. The result of Thai sort algorithm
For Thai segmentation algorithm, the 5400 words dictionary is
used. These words are frequently used. The result of the experiment
is that rate of success of Thai segmentation is about 98-100% success
depends on the types of document. If all words in the document are
in the dictionary, the segmentation can be done at nearly 100%
success. If there are a lot of technical terms, such as in
scientific and technology documents, there will be more mistakes.
CONCLUSION
This paper presents algorithms which are needed for Thai data
processing. The applications of all softwares for the Thai language
data require these algorithms. They can be applied as a filter
program in all operating systems as well as any Thai data processing
systems including the Thai word processor programs.
REFERENCES
1.
2.
3.
4.
This Royal Academy (1982), Thai Dictionary, Thai-Watanapanich
Press, Bangkok, Thailand.
Thai Industrial Standard Institue (1986), TISI 620-1986 Thai
Character code for Computer, Bangkok, Thailand.
Thai Industrial Standard Institue (1988), TISI 829-1988 Thai
Computer Keyboard Lay Out, Bangkok, Thailand.
Y. Poovarawan et al.(1984), A Development of Thai Algorithms,
Proceeding of the 22nd National Conference, Kasetsart University,
Thailand. (in Thai)
Download