Advances in Speech Recognition and Translation

advertisement
Advances in Speech Recognition and Translation
for Bahasa Indonesia
Hammam Riza and Oskar Riandi
Agency for the Assessment and Aplication of Technology (BPPT)
hammam@iptek.net.id
oskar@inn.bppt.go.id
Abstract
We describe the latest advances for Bahasa
Indonesia processing in BPPT especially on speech
recognition and speech translation experiment. We
developed LiSan, a voice recognition system for Linux
operating system and Perisalah, a transcription and
summarization system which use the speech
recognition we have developed in 2008. We reported
the latest Statistical MT system using Cleopatra (ASTAR project) and open source Moses SMT. Often, the
training procedure for statistical machine translation
models is based on maximum likelihood or related
criteria. A general problem of this approach is that
there only a loose relation translation quality on text.
1. Introduction
The use of the speech as man-machine interface ias
breakthrough that increase the accessibility of the
computer. The provisions of the accessibility were
meant to create a supportive environment for
handicapped persons to fully establish an opportunity
in all aspects of the life, including accessing the
sources of information.
The principle of interaction with computer through
keyboard and mouse was replaced by the interaction
through speech that further converted the input into a
form that is known to the computer to undertake a
command or write a document. With appropriate
software, the digital gap between normal humankind
and those who had physical limitations could be
reduced. For the normal user, speech recognition
software enabled the writing of document and
transcribing meeting with ease, fast and hands free. We
will describe these advance applications of Bahasa
Indonesia speech recognition in this paper.
In addition, the increasing globalization and
international communication has resulted in a growing
need for translation services. Traditionally, such
translations are done by human translator, but over
twenty years there has been an increasing effort to
create a machine translation, in effort to cut down cost
and time to translate.
Most modern Statistical Machine Translation
(SMT) system relies on aligned bilingual corpora (bitexts) from which they learn how to translate small
pieces of text. There are many versions of SMT model
available. Early model of SMT is based on individual
words, sometimes has very serious problems,
especially in decoding. Recent SMT system usually
use phrase-based model like translation model and
language model which further applied to speech
translation.
Building the tools for any translation system
involves much iteration of changes and performance
testing. We reported here an advanced method at hand
that gives us assurances that the observed increase in
the test score on a test set reflects true improvement in
system quality.
2. LiSan (Linux Voice Command)
LiSan is an application based on speech recognition
technology that has 3 main functions: as the
identification system of the speech on the Indonesian;
as the interface the computer operation with the voice;
and synthesize the interaction of the user, keyboard and
mouse with the speech.
Moreover, LiSan has several features such as:
Undertook the menu; Undertook and closed the
application; Explore the internet; Moved mouse.
Reboot computer; Turn off the computer as well as
Logout; Administering Linux; Writing of the
document, the email, chatting, etc.
LiSan could be operated in 3 modes that is silent,
command and dictation. Silent mode is used when the
user do not want his voice processed by LiSan.
Command Mode is used when carrying out the
computer command and Dictation mode is used at the
time of writing a document.
In d o n e s ia n A S R
( HTK a n d Ju liu s En g in e )
V o ic e
S o c k e t I/F
Mic
In t e r P r o c e s s
C o m m u n ic a t io n
lib w n c k , a t-s p i,
a tk , g tk + ,
g d k, g n o m e
Visual and Sound
Xm u , X1 1
G TK +
G TK+
G n o m e W in d o w M a n a g e r
G n o m e A p p lic a t io n
Word Processor
(Open Office
Or
Text Editor)
Figure 1. LiSan Framework
LiSan is developed in 2 modules, that is Indonesian
Automatic Speech Recognition using HTK for
Acoustic Model Development, CMU LM Toolkit for
Language Model Development and Julius speech
recognition engine. The second module is inter-process
communication system that was built by making use of
the function socket to Julius and combined it with
gnome library such as ATK (Accessibility Tool Kit),
AT-SPI (Assistive Technology Service Provider
Interface)
and
WNCK
(Window
Navigator
Construction Kit) and also XWindows library such the
USA X11 and XMu.
The speech corpus that is needed to train the system
is provided by two types of corpora: the voice
command and the voice dictation. The speech data for
the voice command consisted of 367 sentences with
351 unique words; it was recorded respectively by 50
men and 50 women. The data for voice dictation
consisted of 500 sentences, consisting of 37991 unique
words, spoken by 5 men and 5 women.
The speech corpus was processed to produce the
model of the Indonesian acoustics using HTK version
3.4. The parameters used are MFCC with zero cepstral
coefficient _0 and the delta coefficient _D
(MFCC_0_D) 25 parameters. In order to be used in
Julius, this parameter was afterwards converted to the
parameter target MFCC_0_D_N_Z with the increase
absolute the log energy suppressed _N and zero
cepstral mean subtracted _Z, totaling 25 parameters.
The automatic labeling approach was carried out
because of the speech corpus do not have any label
using Viterbi forced alignment.
Grammar rules are used for the voice command
utilizing Deterministic Finite Automata (DFA).
Whereas language model is used for writing of the
document (voice dictation) with n-gram approach
implemented using CMU Statistical Language Model
Toolkit. The text data originate from the online
collection of the national newspaper articles,
approximately 2.5 million sentences (423,568 unique
words).
Identification of the speech is carried out with
Gnome/Linux was made by the communication system
between the process of using the function socket that
was available to Julius and library in the Gnome
environment like ATK (Accessibility Tool Kit), ATSPI (Assistive Technology Service Provider Interface)
and WNCK (Window Navigator Construction Kit) etc.
3. Perisalah (Transcription System)
BPPT is making use of speech recognition technology
to make a breakthrough called Perisalah, an application
to produce minutes of meeting. With Perisalah, every
time a speech was delivered in a meeting, then it will
be transcribed automatically. Other information,
including time stamp and speaker’s name will be
recorded in a real time manner. Perisalah has 4 main
functions:
! as the identification system of the speech on
Bahasa Indonesia.
! as transcription system in a meeting
! as document summarization system.
! as the manager of the meeting archives
Several features of Perisalah are as follows:
1. Editing of transcription. The minute-taker could
carry out the process of the correction towards the
transcription of the meeting/the speech in a
manner ounce the fly by hearing the voice being in
a meeting/the speech that wanted to be corrected.
If being encountered the wrong or awkward article
then the correction was carried out in accordance
with the voice that was heard.
2. Summarization. Perisalah could carry out the
summary of the document using SIDoBI
(Indonesian Document Summarization System).
SIDoBI was developed by using the open source
software MEAD.
Figure 2. Perisalah Framework
In principle, the enhancements added to SIDoBI
based on MEAD are the development of Indonesian
IDF dictionary, MeadPHP interface, and web-based
GUI. The IDF dictionary is generated using 2.5
millions of sentences text corpus. The module for the
minute’s archives is developed to facilitate the
handling of the data and information by keeping the
recorder of the meeting in the form of the digital
speech and the transcription of the meeting in the form
of database (mysql)
4. Indonesian-English SMT System
In this section, we will describe our experiment
behind our English – Bahasa Indonesia SMT system,
especially in building Language Model. In similar
languages like English and French, the difference in
word position is small, but in Bahasa Indonesia the
difference in word position is wide. In such cases, short
phrase table poses little problem. However, languages
that differ greatly, like Bahasa Indonesia and English,
require long phrase table for accurate translation.
Performance differs widely depending on the methods
used to build phrase translation table.
Many statistical machine translation tools have been
developed. We utilized general tools for SMT, such as
GIZA++v1.0.3, Moses as well as Cleopatra developed
by NICT-ATR. We complement these systems with a
small number of preprocessing tools for building
temporal corpus.
Moses is a statistical machine translation system
that allows you to automatically train translation model
for any language pair. Moses needs a collection of
translated text (parallel corpus). Similar to Pharaoh
SMT, Moses requires a language model and translation
model toolkits.
Currently, the most successful such systems employ
so-called phrase-based methods. The translation tables
are the main knowledge source for the machine
translation decoder. The decoder consults these tables
to figure out how to translate input in one language
into output in another language. Being a phrase
translation model, the translation tables do not only
contain single word entries, but multi-word entries.
These called phrases, but this concept means nothing
more than an arbitrary sequence of words.
Phrase-based translation models are acquired from a
word-aligned parallel corpus by extracting all phrase
pairs that are consistent with the word alignment.
Given the set of extracted phrase pairs with counts,
various scoring functions are estimated. As in phrasebased models, factor translation model can be seen as
the combination of several components (language
model, reordering model, translation steps, and
generation steps).
Here is an example output from my phrase
translation entry in phrase-table:
bill as well ||| menyelesaikan tagihan ||| 0.333333
cause i saw him a ||| karena saya melihat dia ||| 0.25
2.26104e-10 1 0.0255653 2.718
This entry means that the probability of translating
is 0.33333 and 0.25.
The translator will require a corpus to build the
translation and language model. The language model
should be trained on a corpus that is suitable to the
domain. If the translation model is trained on a parallel
corpus, then the language model should be trained on
the output side of that corpus.
Language model are applied in many natural
language processing applications, such as speech
recognition and machine translation, to encapsulate
syntactic, semantic and pragmatic information.
Language model need a lexicon text consists of a
set word to language model with SRILM toolkit. We
will analyze the accelerator text alphabetic lexicon 5K,
10K, 20K, 50K, and 100K corpus with two different
processor personal computers. First computer has Core
II duo 2G with RAM 16 Gigabytes, second Pentium IV
with RAM 512 Megabytes.
The LM modeling toolkits are SRILM, IRSTLM,
and RandLM toolkit. These toolkits are providing
commands to estimate and compile language models.
We used the SRILM and IRSTLM toolkit to estimate
and compile language model.
We utilize 40K corpus Bahasa Indonesia to the
difference of count SRILM and IRSTLM toolkit. Here
is a 40K Indonesia corpus language model with 3-gram
count on the SRILM toolkit:
[1] Koehn, Philipp, Franz Josef Oc, Daniel Marcu
“Statistical Phrase-Based Translation”, Proceedings of
HLT-2003, 2003.
\data\
ngram 1=14182
ngram 2=89770
ngram 3=24755
-4.338892
-3.507295
-4.453401
-3.542057
adakah -0.4798526
adalah
adat
adik
-0.1455572
\2-grams:
-2.516168
-2.252927
-2.324282
-1.572783
-2.409713
Saya gagal
Saya hampir
Saya hanya
Saya harus
Saya ingin
\3-gram:
-1.075141
-1.029384
-1.780102
-1.655195
-1.655195
5. References
-0.0803811
-0.02823579
-0.1901582
-0.06643879
Saya bisa bekerja
Saya bisa berbicara
Saya bisa berenang
Saya bisa bermain
Saya bisa melihat
A similar result is also obtained using IRSTLM,
except in computing larger order of n-gram. Training a
language model from huge amounts of data can be
definitively memory and time expensive. The IRSTLM
toolkit features algorithms and data structures suitable
to estimate, store, and access very large LMs. The
SRILM toolkit is good for 1-gram but the IRSTLM
toolkit is better for 2-gram and 3-gram count.
In recent years, various methods have been
proposed to automatically evaluate machine translation
quality by comparing hypothesis translations with
reference translations. Examples of such methods are
multi-reference word error rate (WER), positionindependent word error rate, generation string
accuracy, BLEU score, NIST score. All these criteria
try approximate human assessment and often achieve
an astonishing degree correlation to human subjective
evaluation of fluency and adequacy.
Evaluation of SMT using 40K sentences of EnglishBahasa Indonesia corpus resulted in the Bleu score of
0.854.
[2] Murakami, Jin’ichi, Masato Tokuhisa, Satoru
Ikehara “Statistical Machine Translation with Long
Phrase
Table
and
without
Long
Parallel
Sentence”,Proceedings of IWSLT 2008, Hawaii, 2008.
[3] Koehn, Philipp “Statistical Significance Tests for
Machine
Translation
Evaluation”
(http://
www.iccs.infed.ac.uk/~pkoehn/publications/bootstrap2
004.pdf)
[4] Knight, Kevin “Teaching Statistical Machine
Translation”, accessed at http://www.isi.edu/naturallanguage/mt/teaching-mt.pdf
[5] Mauser, Arne, Evgeny Matusov, Hermann Ney
“Training a Statistical Machine Translation System
without Giza++”, Proceeding of LREC 2006.
[6] Moses, http://www.statmt.org/moses/
[7] Budiono, Hammam Riza, Adiansya Prasetya,
Henky Mulyadi “Bidirectional Indonesian – English
Statistical Machine Translation”, Balai Ipteknet,
Agency for the Assessment and Aplication of
Technology (BPPT), 2008.
[8] Oskar Riandi, Agung Santosa, Gembong S.
Wibowanto, Gati C. Handoyo, Sigit H. Prayoga,
“IGOS Linux Voice Command”, Nasional Seminar on
Empowering Local Languages Trough ICT, 10th of
August 2008, Jakarta, Indonesia
[9] Bowo Prasetyo, Teduh Uliniansyah, Oskar Riandi,
“SIDoBI: Indonesian Language Document
Summarization System”, International Conference on
Rural Information and Communication Technology
2009, 17th -18th of June, Bandung, Indonesia
[10] D. Radev, T. Allison, S. Blair-Goldensohn, J.
Blitzer, A. Celebi, S.Dimitrov, E. Drabek, A. Hakim,
W. Lam, D. Liu, J. Otterbacher, H., Qi, H. Saggion, S.
Teufel, M. Topper, A. Winkel, Z. Zhang, “MEAD - a
platform for multidocument multilingual text
summarization” in LREC 2004, Portugal, May, 2004.
Download