
Toward Thai-English Speech Translation
Chai Wutiwiwatchai
National Electronics and Computer Technology Center,
112, Pahonyothin Rd., Klong-luang, Pathumthani 12120 Thailand
Initiated since Oct 2006, the speech-to-speech translation project (S2S) under Human Language Technology
laboratory at NECTEC has moved on building a simple prototype using our several existing pieces of
technology. A well-known HMM and n-gram based statistical ASR is connected to a rule-based MT via 1-best
recognition result. A corpus-based unit-selection TTS engine generates voice from the translation output. An
alpha system, trained to translate English to Thai, can recognize a small set of approximately 450 lexical words
and 200 sentence patterns in a tourist domain. On-going work is to extend the rule-based MT with a translation
memory technique, so that the performance of translation in limited domains can be significantly improved.
1. Introduction
In Thailand, two languages are mainly used
including Thai as official and English. However,
only few percentages of Thai people are skillful in
communicating using English. Translation between
English and Thai hence becomes an important issue.
Not only translation of written text, but also
translation of speech is of interest since it can be
applied in practical speech communication.
translation (SST) has been extensively researched
since many years ago. Most of works were on some
major languages such as translation among
European languages, American English, Mandarin
Chinese, and Japanese. There is no initiative of such
research for the Thai language. In the National
Electronics and Computer Technology Center
(NECTEC), Thailand, several basic modules
required for building an SST engine including Thai
automatic speech recognition (ASR), English-Thai
machine translation (MT), and Thai text-to-speech
synthesis (TTS) have been developed. It is then
ready to extend the research to cover English-Thai
The aim of the 3-year SST project initiated by
NECTEC is to build an English-Thai SST service
over the Internet for a travel domain, i.e. to be used
by foreigners who journey in Thailand. In the first
year, the baseline system combining the existing
basic modules applied for the travel domain is
developing. In the rest two years, improving such
system by introducing enhanced versions of basic
modules as well as efficient ways to combine such
basic modules will be introduced. This article serves
as the first project report covering the status of
baseline system, problems, and the future plan to
enhance the baseline system.
2. English-Thai SST
2.1. Automatic speech recognition (ASR)
To translate English speech to Thai speech, the
first component is an English ASR module. Our
current prototype of English ASR adopted a
well-known SPHINX toolkit, developed by
Carnegie Mellon University [1]. An American
English acoustic model has been provided with the
toolkit. An n-gram language model was trained by a
small set of sentences in travel domain. The training
text contains 210 patterns of sentences spanning
over 480 lexical words, all prepared by hands.
Figure 1 shows some examples of sentence pattern.
In the return direction, a Thai ASR is required.
Instead of using the SPHINX toolkit, we build our
own Thai ASR toolkit, namely “ISPEECH” [2],
which accepts an acoustic model in the Hidden
Markov toolkit (HTK) format proposed by
Cambridge University [3]. The ISPEECH toolkit
that supports an n-gram language model is currently
under developing.
Fig. 1. Examples of sentence patterns for language
modeling (uppercases are word classes, bracket
means repetition)
2.2. Machine translation (MT)
English-Thai machine translation has been
researched in NECTEC for over 7 years. A
general-domain MT engine, called “PARSIT” [4],
has been set up for a public service for 4 years. The
MT system adopted from NEC, Japan is based on
handcrafted syntactic/semantic rules. At present,
only translation service from English to Thai is
active. In the opposite direction from Thai to
English, the task is much more difficult since
analyzing Thai text requires several additional steps.
Since Thai writing has neither explicit marker for
breaking sentences nor words, automatic
word/sentence boundary detection is needed in the
preprocessing step. Moreover, writing is somewhat
syntactically flexible, i.e. like a spoken language. A
huge effort is currently put for a rule-based MT,
where Thai sentences are analyzed using
Combinatory categorical grammar (CCG) [5].
2.3. Text-to-speech synthesis (TTS)
A Thai text-to-speech synthesis engine
developed by NECTEC is named “VAJA” [6]. The
latest version exploits a corpus-based unit-selection
technique. Several features including phoneme
context, syllabic tone, and phoneme duration
constrain the selection. To generate English sound,
we can use other existing English TTS engine such
as that provided with the Microsoft speech API.
2.4 Integration
All three basic modules described in previous
subsections were integrated simply by using the
1-best result of ASR as an input of MT and
generating a sound of the MT output by TTS. The
prototype system, run on PC, utilizes a push-to-talk
interface so that errors made by ASR can be
alleviated. Figure 2 shows the screen capture of the
SST system. The prototype version only receives
English and returns Thai utterances. The two main
displays express speech recognition and translation
Fig. 2. A screen capture of the prototype SST
3. Problems and Solutions
Observing from the prototype system, there are
several problems needed to be solved. The first
problem is the coverage of lexical words
recognizable by ASR. We need to enlarge the
language model training set, which at the same time
increase the number of lexical words. The acoustic
model can be more accurate if a speech corpus in
the particular domain is provided. We hence need to
collect a speech corpus in the travel domain.
Regarding the MT engine, the most critical
problem is that it originally supports written rather
than spoken Thai. Translation therefore yields
unnatural for travel conversation. Since our MT
engine is based on rules, it is highly difficult to
adapt to a spoken language. To translate in a
specific domain, an efficient way is to conduct other
techniques of MT such as example-based MT
(EBMT), statistical MT (SMT), or worth a
translation memory (TM). The TM aims to reuse
existing translations once the input has been found
and its translation has been corrected in the past.
Instead of considering the totally same phrase, the
EBMT technique provides a more flexible way to
extract some parts of the phrases or sentences and
produces a translation result by analogy. Since
either the TM or the EBMT engine can be trained
given a parallel text, it is convenient to construct a
MT engine in a particular domain or a translation
Although our TTS has been commercialized,
there appear several problems that cause unnatural
synthetic speech. First, synthetic speech is
sometimes not smooth due to mismatching of
concatenated units. An appropriate signal smoothing
technique and/or an enhanced algorithm of unit
selection is required. Unit selection can be
improved by incorporating other prosodic features
such as F0 and conducting not only the targeted cost
but also a concatenation cost [7]. Intelligibility of
the system in pronouncing any text is one of the
most important issues. Based on our analysis on the
text processing process, word segmentation and
part-of-speech (POS) turn to be the first priority
needed to be improved.
To integrate the basic modules, ASR, MT, and
TTS, rather than just concatenating them, more
strategic approaches can be considered to improve
the overall performance. Rescoring N-best outputs
of ASR by the syntax/semantic parser of MT can
alleviate the problem of recognition errors.
Knowing the conversation domain can help
domain-specific ASR acoustic and language models,
EBMT or TM, and even updating the TTS corpus.
Finally, we aim the system to be accessed by a
portable device such as PDA. A simple way is to
use the portable device only as a terminal for speech
recording and playing back. All the other processes
including ASR, MT, and TTS are run on a server
which connects to the terminal device by any
wireless communication mean. Though difficult, it
is highly benefit to integrate all modules in the
portable device, which will be much more
convenient to use in any place. The issue is now
widely researched.
4. On-going Works
Current activities of the NECTEC SST project
comprise several issues. In the ASR module, we are
in the process of developing out own ASR toolkit
that support the n-gram language model, which will
be used by the Thai-to-English SST system. To
enhance the acoustic and language models, a Thai
speech corpus as well as a Thai-English parallel
corpus in the travel domain is constructing1. Each
monolingual part of the parallel text will be used to
train a specific ASR language model.
For the MT module, we can use the parallel text
to train a TM, EBMT or SMT. We expect to
combine the trained model with our existing
rule-based model, which will be hopefully more
effective than each individual model. Recently, we
have developed a TM engine. It will be incorporated
in the SST engine in this early stage.
In the part of TTS, several issues have been
researched and integrated in the system. On-going
works include incorporating a Thai intonation
model in unit-selection, improving the accuracy of
Thai text segmentation, and learning for hidden
Markov model (HMM) based speech synthesis,
which will hopefully provide a good framework for
compiling TTS on portable devices.
5. Conclusion
Although all three basic components, ASR, MT,
and TTS have been extensively researched and
developed for over five years, integrating them to
form a domain-specific SST engine is not trivial.
The major problem is the generalization of our
rule-based MT, which provides unnatural translation
results in a particular domain. Adapting the MT
engine requires a trainable model. The critical issue
is hence to develop such trainable MT including
TM, in the first stage, and then EBMT or SMT. The
ASR engine itself is completely data-driven. The
main task after the mentioned issue becomes the
collection of domain-specific speech and
The task is performed by the ATR, Japan, under the
Asian Speech Translation Advanced Research (A-STAR)
parallel-text corpus used to train the ASR and MT
engines. Under the A-STAR consortium, such
corpora will not cover only two languages, but also
span over several member languages, for examples,
Japanese, Korean, Indonesian, and Chinese. This
will be potential in developing a multilingual SST
The author would like to thank the ATR for
supporting the speech and text corpora used in the
SST development.
