Advances in Speech Recognition and Translation for Bahasa Indonesia Hammam Riza and Oskar Riandi Agency for the Assessment and Aplication of Technology (BPPT) hammam@iptek.net.id oskar@inn.bppt.go.id Abstract We describe the latest advances for Bahasa Indonesia processing in BPPT especially on speech recognition and speech translation experiment. We developed LiSan, a voice recognition system for Linux operating system and Perisalah, a transcription and summarization system which use the speech recognition we have developed in 2008. We reported the latest Statistical MT system using Cleopatra (ASTAR project) and open source Moses SMT. Often, the training procedure for statistical machine translation models is based on maximum likelihood or related criteria. A general problem of this approach is that there only a loose relation translation quality on text. 1. Introduction The use of the speech as man-machine interface ias breakthrough that increase the accessibility of the computer. The provisions of the accessibility were meant to create a supportive environment for handicapped persons to fully establish an opportunity in all aspects of the life, including accessing the sources of information. The principle of interaction with computer through keyboard and mouse was replaced by the interaction through speech that further converted the input into a form that is known to the computer to undertake a command or write a document. With appropriate software, the digital gap between normal humankind and those who had physical limitations could be reduced. For the normal user, speech recognition software enabled the writing of document and transcribing meeting with ease, fast and hands free. We will describe these advance applications of Bahasa Indonesia speech recognition in this paper. In addition, the increasing globalization and international communication has resulted in a growing need for translation services. Traditionally, such translations are done by human translator, but over twenty years there has been an increasing effort to create a machine translation, in effort to cut down cost and time to translate. Most modern Statistical Machine Translation (SMT) system relies on aligned bilingual corpora (bitexts) from which they learn how to translate small pieces of text. There are many versions of SMT model available. Early model of SMT is based on individual words, sometimes has very serious problems, especially in decoding. Recent SMT system usually use phrase-based model like translation model and language model which further applied to speech translation. Building the tools for any translation system involves much iteration of changes and performance testing. We reported here an advanced method at hand that gives us assurances that the observed increase in the test score on a test set reflects true improvement in system quality. 2. LiSan (Linux Voice Command) LiSan is an application based on speech recognition technology that has 3 main functions: as the identification system of the speech on the Indonesian; as the interface the computer operation with the voice; and synthesize the interaction of the user, keyboard and mouse with the speech. Moreover, LiSan has several features such as: Undertook the menu; Undertook and closed the application; Explore the internet; Moved mouse. Reboot computer; Turn off the computer as well as Logout; Administering Linux; Writing of the document, the email, chatting, etc. LiSan could be operated in 3 modes that is silent, command and dictation. Silent mode is used when the user do not want his voice processed by LiSan. Command Mode is used when carrying out the computer command and Dictation mode is used at the time of writing a document. In d o n e s ia n A S R ( HTK a n d Ju liu s En g in e ) V o ic e S o c k e t I/F Mic In t e r P r o c e s s C o m m u n ic a t io n lib w n c k , a t-s p i, a tk , g tk + , g d k, g n o m e Visual and Sound Xm u , X1 1 G TK + G TK+ G n o m e W in d o w M a n a g e r G n o m e A p p lic a t io n Word Processor (Open Office Or Text Editor) Figure 1. LiSan Framework LiSan is developed in 2 modules, that is Indonesian Automatic Speech Recognition using HTK for Acoustic Model Development, CMU LM Toolkit for Language Model Development and Julius speech recognition engine. The second module is inter-process communication system that was built by making use of the function socket to Julius and combined it with gnome library such as ATK (Accessibility Tool Kit), AT-SPI (Assistive Technology Service Provider Interface) and WNCK (Window Navigator Construction Kit) and also XWindows library such the USA X11 and XMu. The speech corpus that is needed to train the system is provided by two types of corpora: the voice command and the voice dictation. The speech data for the voice command consisted of 367 sentences with 351 unique words; it was recorded respectively by 50 men and 50 women. The data for voice dictation consisted of 500 sentences, consisting of 37991 unique words, spoken by 5 men and 5 women. The speech corpus was processed to produce the model of the Indonesian acoustics using HTK version 3.4. The parameters used are MFCC with zero cepstral coefficient _0 and the delta coefficient _D (MFCC_0_D) 25 parameters. In order to be used in Julius, this parameter was afterwards converted to the parameter target MFCC_0_D_N_Z with the increase absolute the log energy suppressed _N and zero cepstral mean subtracted _Z, totaling 25 parameters. The automatic labeling approach was carried out because of the speech corpus do not have any label using Viterbi forced alignment. Grammar rules are used for the voice command utilizing Deterministic Finite Automata (DFA). Whereas language model is used for writing of the document (voice dictation) with n-gram approach implemented using CMU Statistical Language Model Toolkit. The text data originate from the online collection of the national newspaper articles, approximately 2.5 million sentences (423,568 unique words). Identification of the speech is carried out with Gnome/Linux was made by the communication system between the process of using the function socket that was available to Julius and library in the Gnome environment like ATK (Accessibility Tool Kit), ATSPI (Assistive Technology Service Provider Interface) and WNCK (Window Navigator Construction Kit) etc. 3. Perisalah (Transcription System) BPPT is making use of speech recognition technology to make a breakthrough called Perisalah, an application to produce minutes of meeting. With Perisalah, every time a speech was delivered in a meeting, then it will be transcribed automatically. Other information, including time stamp and speaker’s name will be recorded in a real time manner. Perisalah has 4 main functions: ! as the identification system of the speech on Bahasa Indonesia. ! as transcription system in a meeting ! as document summarization system. ! as the manager of the meeting archives Several features of Perisalah are as follows: 1. Editing of transcription. The minute-taker could carry out the process of the correction towards the transcription of the meeting/the speech in a manner ounce the fly by hearing the voice being in a meeting/the speech that wanted to be corrected. If being encountered the wrong or awkward article then the correction was carried out in accordance with the voice that was heard. 2. Summarization. Perisalah could carry out the summary of the document using SIDoBI (Indonesian Document Summarization System). SIDoBI was developed by using the open source software MEAD. Figure 2. Perisalah Framework In principle, the enhancements added to SIDoBI based on MEAD are the development of Indonesian IDF dictionary, MeadPHP interface, and web-based GUI. The IDF dictionary is generated using 2.5 millions of sentences text corpus. The module for the minute’s archives is developed to facilitate the handling of the data and information by keeping the recorder of the meeting in the form of the digital speech and the transcription of the meeting in the form of database (mysql) 4. Indonesian-English SMT System In this section, we will describe our experiment behind our English – Bahasa Indonesia SMT system, especially in building Language Model. In similar languages like English and French, the difference in word position is small, but in Bahasa Indonesia the difference in word position is wide. In such cases, short phrase table poses little problem. However, languages that differ greatly, like Bahasa Indonesia and English, require long phrase table for accurate translation. Performance differs widely depending on the methods used to build phrase translation table. Many statistical machine translation tools have been developed. We utilized general tools for SMT, such as GIZA++v1.0.3, Moses as well as Cleopatra developed by NICT-ATR. We complement these systems with a small number of preprocessing tools for building temporal corpus. Moses is a statistical machine translation system that allows you to automatically train translation model for any language pair. Moses needs a collection of translated text (parallel corpus). Similar to Pharaoh SMT, Moses requires a language model and translation model toolkits. Currently, the most successful such systems employ so-called phrase-based methods. The translation tables are the main knowledge source for the machine translation decoder. The decoder consults these tables to figure out how to translate input in one language into output in another language. Being a phrase translation model, the translation tables do not only contain single word entries, but multi-word entries. These called phrases, but this concept means nothing more than an arbitrary sequence of words. Phrase-based translation models are acquired from a word-aligned parallel corpus by extracting all phrase pairs that are consistent with the word alignment. Given the set of extracted phrase pairs with counts, various scoring functions are estimated. As in phrasebased models, factor translation model can be seen as the combination of several components (language model, reordering model, translation steps, and generation steps). Here is an example output from my phrase translation entry in phrase-table: bill as well ||| menyelesaikan tagihan ||| 0.333333 cause i saw him a ||| karena saya melihat dia ||| 0.25 2.26104e-10 1 0.0255653 2.718 This entry means that the probability of translating is 0.33333 and 0.25. The translator will require a corpus to build the translation and language model. The language model should be trained on a corpus that is suitable to the domain. If the translation model is trained on a parallel corpus, then the language model should be trained on the output side of that corpus. Language model are applied in many natural language processing applications, such as speech recognition and machine translation, to encapsulate syntactic, semantic and pragmatic information. Language model need a lexicon text consists of a set word to language model with SRILM toolkit. We will analyze the accelerator text alphabetic lexicon 5K, 10K, 20K, 50K, and 100K corpus with two different processor personal computers. First computer has Core II duo 2G with RAM 16 Gigabytes, second Pentium IV with RAM 512 Megabytes. The LM modeling toolkits are SRILM, IRSTLM, and RandLM toolkit. These toolkits are providing commands to estimate and compile language models. We used the SRILM and IRSTLM toolkit to estimate and compile language model. We utilize 40K corpus Bahasa Indonesia to the difference of count SRILM and IRSTLM toolkit. Here is a 40K Indonesia corpus language model with 3-gram count on the SRILM toolkit: [1] Koehn, Philipp, Franz Josef Oc, Daniel Marcu “Statistical Phrase-Based Translation”, Proceedings of HLT-2003, 2003. \data\ ngram 1=14182 ngram 2=89770 ngram 3=24755 -4.338892 -3.507295 -4.453401 -3.542057 adakah -0.4798526 adalah adat adik -0.1455572 \2-grams: -2.516168 -2.252927 -2.324282 -1.572783 -2.409713 Saya gagal Saya hampir Saya hanya Saya harus Saya ingin \3-gram: -1.075141 -1.029384 -1.780102 -1.655195 -1.655195 5. References -0.0803811 -0.02823579 -0.1901582 -0.06643879 Saya bisa bekerja Saya bisa berbicara Saya bisa berenang Saya bisa bermain Saya bisa melihat A similar result is also obtained using IRSTLM, except in computing larger order of n-gram. Training a language model from huge amounts of data can be definitively memory and time expensive. The IRSTLM toolkit features algorithms and data structures suitable to estimate, store, and access very large LMs. The SRILM toolkit is good for 1-gram but the IRSTLM toolkit is better for 2-gram and 3-gram count. In recent years, various methods have been proposed to automatically evaluate machine translation quality by comparing hypothesis translations with reference translations. Examples of such methods are multi-reference word error rate (WER), positionindependent word error rate, generation string accuracy, BLEU score, NIST score. All these criteria try approximate human assessment and often achieve an astonishing degree correlation to human subjective evaluation of fluency and adequacy. Evaluation of SMT using 40K sentences of EnglishBahasa Indonesia corpus resulted in the Bleu score of 0.854. [2] Murakami, Jin’ichi, Masato Tokuhisa, Satoru Ikehara “Statistical Machine Translation with Long Phrase Table and without Long Parallel Sentence”,Proceedings of IWSLT 2008, Hawaii, 2008. [3] Koehn, Philipp “Statistical Significance Tests for Machine Translation Evaluation” (http:// www.iccs.infed.ac.uk/~pkoehn/publications/bootstrap2 004.pdf) [4] Knight, Kevin “Teaching Statistical Machine Translation”, accessed at http://www.isi.edu/naturallanguage/mt/teaching-mt.pdf [5] Mauser, Arne, Evgeny Matusov, Hermann Ney “Training a Statistical Machine Translation System without Giza++”, Proceeding of LREC 2006. [6] Moses, http://www.statmt.org/moses/ [7] Budiono, Hammam Riza, Adiansya Prasetya, Henky Mulyadi “Bidirectional Indonesian – English Statistical Machine Translation”, Balai Ipteknet, Agency for the Assessment and Aplication of Technology (BPPT), 2008. [8] Oskar Riandi, Agung Santosa, Gembong S. Wibowanto, Gati C. Handoyo, Sigit H. Prayoga, “IGOS Linux Voice Command”, Nasional Seminar on Empowering Local Languages Trough ICT, 10th of August 2008, Jakarta, Indonesia [9] Bowo Prasetyo, Teduh Uliniansyah, Oskar Riandi, “SIDoBI: Indonesian Language Document Summarization System”, International Conference on Rural Information and Communication Technology 2009, 17th -18th of June, Bandung, Indonesia [10] D. Radev, T. Allison, S. Blair-Goldensohn, J. Blitzer, A. Celebi, S.Dimitrov, E. Drabek, A. Hakim, W. Lam, D. Liu, J. Otterbacher, H., Qi, H. Saggion, S. Teufel, M. Topper, A. Winkel, Z. Zhang, “MEAD - a platform for multidocument multilingual text summarization” in LREC 2004, Portugal, May, 2004.