Indonesian Language

advertisement
AFNLP 2008 Meeting
Indonesia Country Report
Hammam Riza
hammam@iptek.net.id
Agency for the Assessment and
Application of Technology (BPPT)
Ministry of Research and Technology
Republic of Indonesia
1
TOC




Past Activities
Activities in 2007
Activities Plan 2008, 2009
National Language Year 2008
Past NLP Research Projects
in Indonesia












Indonesian Text-To-Speech (BPPT, ITB, UI)
GDA/MMA/Linguistic-DS MPEG-7 (Multimedia Annotation)
Cross-Linguistic Portal (dictionaries, corpus, tools)
Web translator (WebTRans)
Standard Indonesian Language Corpus (SILC)
Indonesian Language Dictionaries Project (KBBI)
English-Indonesia Parallel Corpus (INCI)
Speech recognition/synthesis system (Bandung Institute of
Technology/ Telkom RDC/University of Indonesia)
Information retrieval (ITB and University of Indonesia)
Text/Image processing tools (Gajah Mada University)
Computational lexicon (National Language Center)
Computational morphology (Atmajaya University)
3
Promotion of Language
Technologies (2007)

National Language Congress XII in Solo
introducing toolkit to build speech database
for endangered languages and Atmajaya
Language Workshop (June 2007) in Jakarta
on promoting local computing policy and
speech technologies (both keynote speeches
by Dr. Hammam Riza)

Promotion of Context Sensitive Dictionary
Project for Speech Translation Corpus for
Aceh Tsunami Region; (IndonesianAcehnese, bidirectional)
4
Activities in Machine Translation
(2006-2007)

Rule-based system Indonesian-English
translator (started in 2006) was launched to
the market June 2007 by ITB

This translator is combined with English TTS
(Windows), and Indonesian TTS (proprietary)
Experiment of Statistical MT – using Pharaoh
decoder (Eng-Indo parallel corpus) by

5
Current Activities in Speech Tech
• Telkom RDC & BPPT collaboration on Speech
Recognition and Summarization
• Indonesia Goes Open Source (IGOS) speech
recognition system (funded by Ministry of
Research and Technology)
• Speech recognition system for Bahasa Indonesia
(University of Indonesia)
– Transcribing speech data that contains broadcast
TV and Radio news
– Applications:
• sending short message service (sms)
• IVR ( health and tourism services)
• Research for “intonation by example” and
“automatic prosody pattern extractor” using
Artificial Neural Network (ANN)
• Text to Speech system for local languages
(ITB/UI)
6
100th Year of Bahasa Indonesia –
National Language Year 2008

Series of event culminating at the International
Conference on Bahasa Indonesia (Oct 2008)



Importance of Indonesian – Its roles, functions in national life &
development (policy making, business, media, education)
Language planning (shaping change)
6 keynote speakers from AFNLP will be invited by
Indonesian government through out the year
Major Activities for 2008

Local Language Resource Projects (Language
Center)

Indonesian and Local Languages - Wordnet

MALINDO (Malaysia-Indonesia) joint projects

Speech to speech translation for Asian languages
(A-STAR)

Speech database Telkom RDC/BPPT (APT support)

Language Resources and Translation English Indonesia (collaboration with PAN Localization)

Speech Corpus for Local Languages (Endangered
Languages) – using BLARK (ELDA)
8
Activities Plan for 2008-2009

Speech Recognition and Phrase-based Statistical Machine
Translation (SMT) system for bidirectional Indonesian-English
and Indonesian-Japanese

Mapping and SMT for Indonesian-Regional Languages
(Bahasa Nusantara) and for German, French, Chinese and
Arabic (cross border languages)

Information Retrieval (cross language speech retrieval)


Topic Detection and Tracking (TDT)




Searching and retrieving Indonesian speech data
Identifying topics in speech data collection
Classifying new data to the existing topics in the collection
Speech Synthesis
Speech Summarization

Summarize the Indonesian speech documents
9
E-dictionary project
National Language Center

Size & Comprehensiveness:



Method:



corpus-based,
primary data for largest print dict
Kamus Besar
Bahasa Indonesia
(KBBI) 3rd ed.
Usefulness:



200,000 entries
many subject areas are covered
find the words you need
definitions and examples are helpful
Users

writers, journalists, editors, scientists,
academics, teachers, students, business
people, lawyers etc…
Echols & Shadily’s
Eng-Ind. dictionary.
In Indonesia, there are at least 13 biggest local
languages with at least one million speakers
Javanese (75,200,000)
Madurese (13,694,000)
Buginese (4,000,000)
Sasak (2,100,000)
Rejang (1,000,000)
Sundanese (27,000,000)
Minangkabau (6,500,000)
Balinese (3,800,000)
Makassarese (1,600,000)
Malay (20,000,000)
Batak (5,150,000)
Acehnese (3,000,000)
Lampung (1,500,000)
ACEH – 32 local languages
EAST JAVA – 6 local languages
LOCAL & CROSS-BORDER LANGUAGES
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Note:
bn id kh
Cross-Border Languages in Indonesia:
English, Arabic, Chinese, French,
German, Dutch, Japanese, etc.
% Local Languages
la
my mm ph
sg
th
tp
vn
South East Asia
% English
% Other Cross Boader Languages
Language Digital Divide
Language Preservation




Survey of indigenous local languages
Local computing policy will be
developed for major local languages
Endangered languages are identified
and preserved by means of ICT
Language resources collection for
official and major local languages
Thank You
Any comments please mail to
hammam@iptek.net.id
Download