Toward Thai-English Speech Translation Chai Wutiwiwatchai National Electronics and Computer Technology Center, 112, Pahonyothin Rd., Klong-luang, Pathumthani 12120 Thailand Initiated since Oct 2006, the speech-to-speech translation project (S2S) under Human Language Technology laboratory at NECTEC has moved on building a simple prototype using our several existing pieces of technology. A well-known HMM and n-gram based statistical ASR is connected to a rule-based MT via 1-best recognition result. A corpus-based unit-selection TTS engine generates voice from the translation output. An alpha system, trained to translate English to Thai, can recognize a small set of approximately 450 lexical words and 200 sentence patterns in a tourist domain. On-going work is to extend the rule-based MT with a translation memory technique, so that the performance of translation in limited domains can be significantly improved. 1. Introduction In Thailand, two languages are mainly used including Thai as official and English. However, only few percentages of Thai people are skillful in communicating using English. Translation between English and Thai hence becomes an important issue. Not only translation of written text, but also translation of speech is of interest since it can be applied in practical speech communication. Speech translation or speech-to-speech translation (SST) has been extensively researched since many years ago. Most of works were on some major languages such as translation among European languages, American English, Mandarin Chinese, and Japanese. There is no initiative of such research for the Thai language. In the National Electronics and Computer Technology Center (NECTEC), Thailand, several basic modules required for building an SST engine including Thai automatic speech recognition (ASR), English-Thai machine translation (MT), and Thai text-to-speech synthesis (TTS) have been developed. It is then ready to extend the research to cover English-Thai SST. The aim of the 3-year SST project initiated by NECTEC is to build an English-Thai SST service over the Internet for a travel domain, i.e. to be used by foreigners who journey in Thailand. In the first year, the baseline system combining the existing basic modules applied for the travel domain is developing. In the rest two years, improving such system by introducing enhanced versions of basic modules as well as efficient ways to combine such basic modules will be introduced. This article serves as the first project report covering the status of baseline system, problems, and the future plan to enhance the baseline system. 2. English-Thai SST 2.1. Automatic speech recognition (ASR) To translate English speech to Thai speech, the first component is an English ASR module. Our current prototype of English ASR adopted a well-known SPHINX toolkit, developed by Carnegie Mellon University [1]. An American English acoustic model has been provided with the toolkit. An n-gram language model was trained by a small set of sentences in travel domain. The training text contains 210 patterns of sentences spanning over 480 lexical words, all prepared by hands. Figure 1 shows some examples of sentence pattern. In the return direction, a Thai ASR is required. Instead of using the SPHINX toolkit, we build our own Thai ASR toolkit, namely “ISPEECH” [2], which accepts an acoustic model in the Hidden Markov toolkit (HTK) format proposed by Cambridge University [3]. The ISPEECH toolkit that supports an n-gram language model is currently under developing. Fig. 1. Examples of sentence patterns for language modeling (uppercases are word classes, bracket means repetition) 2.2. Machine translation (MT) English-Thai machine translation has been researched in NECTEC for over 7 years. A general-domain MT engine, called “PARSIT” [4], has been set up for a public service for 4 years. The MT system adopted from NEC, Japan is based on handcrafted syntactic/semantic rules. At present, only translation service from English to Thai is active. In the opposite direction from Thai to English, the task is much more difficult since analyzing Thai text requires several additional steps. Since Thai writing has neither explicit marker for breaking sentences nor words, automatic word/sentence boundary detection is needed in the preprocessing step. Moreover, writing is somewhat syntactically flexible, i.e. like a spoken language. A huge effort is currently put for a rule-based MT, where Thai sentences are analyzed using Combinatory categorical grammar (CCG) [5]. 2.3. Text-to-speech synthesis (TTS) A Thai text-to-speech synthesis engine developed by NECTEC is named “VAJA” [6]. The latest version exploits a corpus-based unit-selection technique. Several features including phoneme context, syllabic tone, and phoneme duration constrain the selection. To generate English sound, we can use other existing English TTS engine such as that provided with the Microsoft speech API. 2.4 Integration All three basic modules described in previous subsections were integrated simply by using the 1-best result of ASR as an input of MT and generating a sound of the MT output by TTS. The prototype system, run on PC, utilizes a push-to-talk interface so that errors made by ASR can be alleviated. Figure 2 shows the screen capture of the SST system. The prototype version only receives English and returns Thai utterances. The two main displays express speech recognition and translation results. Fig. 2. A screen capture of the prototype SST system 3. Problems and Solutions Observing from the prototype system, there are several problems needed to be solved. The first problem is the coverage of lexical words recognizable by ASR. We need to enlarge the language model training set, which at the same time increase the number of lexical words. The acoustic model can be more accurate if a speech corpus in the particular domain is provided. We hence need to collect a speech corpus in the travel domain. Regarding the MT engine, the most critical problem is that it originally supports written rather than spoken Thai. Translation therefore yields unnatural for travel conversation. Since our MT engine is based on rules, it is highly difficult to adapt to a spoken language. To translate in a specific domain, an efficient way is to conduct other techniques of MT such as example-based MT (EBMT), statistical MT (SMT), or worth a translation memory (TM). The TM aims to reuse existing translations once the input has been found and its translation has been corrected in the past. Instead of considering the totally same phrase, the EBMT technique provides a more flexible way to extract some parts of the phrases or sentences and produces a translation result by analogy. Since either the TM or the EBMT engine can be trained given a parallel text, it is convenient to construct a MT engine in a particular domain or a translation style. Although our TTS has been commercialized, there appear several problems that cause unnatural synthetic speech. First, synthetic speech is sometimes not smooth due to mismatching of concatenated units. An appropriate signal smoothing technique and/or an enhanced algorithm of unit selection is required. Unit selection can be improved by incorporating other prosodic features such as F0 and conducting not only the targeted cost but also a concatenation cost [7]. Intelligibility of the system in pronouncing any text is one of the most important issues. Based on our analysis on the text processing process, word segmentation and part-of-speech (POS) turn to be the first priority needed to be improved. To integrate the basic modules, ASR, MT, and TTS, rather than just concatenating them, more strategic approaches can be considered to improve the overall performance. Rescoring N-best outputs of ASR by the syntax/semantic parser of MT can alleviate the problem of recognition errors. Knowing the conversation domain can help reducing overall mistakes by training domain-specific ASR acoustic and language models, EBMT or TM, and even updating the TTS corpus. Finally, we aim the system to be accessed by a portable device such as PDA. A simple way is to use the portable device only as a terminal for speech recording and playing back. All the other processes including ASR, MT, and TTS are run on a server which connects to the terminal device by any wireless communication mean. Though difficult, it is highly benefit to integrate all modules in the portable device, which will be much more convenient to use in any place. The issue is now widely researched. 4. On-going Works Current activities of the NECTEC SST project comprise several issues. In the ASR module, we are in the process of developing out own ASR toolkit that support the n-gram language model, which will be used by the Thai-to-English SST system. To enhance the acoustic and language models, a Thai speech corpus as well as a Thai-English parallel corpus in the travel domain is constructing1. Each monolingual part of the parallel text will be used to train a specific ASR language model. For the MT module, we can use the parallel text to train a TM, EBMT or SMT. We expect to combine the trained model with our existing rule-based model, which will be hopefully more effective than each individual model. Recently, we have developed a TM engine. It will be incorporated in the SST engine in this early stage. In the part of TTS, several issues have been researched and integrated in the system. On-going works include incorporating a Thai intonation model in unit-selection, improving the accuracy of Thai text segmentation, and learning for hidden Markov model (HMM) based speech synthesis, which will hopefully provide a good framework for compiling TTS on portable devices. 5. Conclusion Although all three basic components, ASR, MT, and TTS have been extensively researched and developed for over five years, integrating them to form a domain-specific SST engine is not trivial. The major problem is the generalization of our rule-based MT, which provides unnatural translation results in a particular domain. Adapting the MT engine requires a trainable model. The critical issue is hence to develop such trainable MT including TM, in the first stage, and then EBMT or SMT. The ASR engine itself is completely data-driven. The main task after the mentioned issue becomes the collection of domain-specific speech and 1 The task is performed by the ATR, Japan, under the Asian Speech Translation Advanced Research (A-STAR) consortium. parallel-text corpus used to train the ASR and MT engines. Under the A-STAR consortium, such corpora will not cover only two languages, but also span over several member languages, for examples, Japanese, Korean, Indonesian, and Chinese. This will be potential in developing a multilingual SST engine. Acknowledgments The author would like to thank the ATR for supporting the speech and text corpora used in the SST development. References [1] CMU SPHINX, http://cmusphinx.sourceforge.net/ [2] ISPEECH, NECTEC ASR toolkit, http://www. nectec.or.th/rdi/ispeech/ [3] The HTK book version 3.1, Cambridge University, Dec 2001, http://htk.eng.cam.ac.uk/ [4] PARSIT, NECTEC English-Thai MT service, http://www.suparsit.com/ [5] Steedman, M., “The syntactic process”, The MIT Press, Cambridge Mass. [6] VAJA, NECTEC Thai TTS engine, http://www. nectec.or.th/rdi/vaja/ [7] Hunt, A., Black, A., “Unit selection in a concatenative speech synthesis system using a large speech database”, Proc. ICASSP 1996, vol.1, pp. 373-376, 1996.