AFNLP 2008 Meeting Indonesia Country Report Hammam Riza hammam@iptek.net.id Agency for the Assessment and Application of Technology (BPPT) Ministry of Research and Technology Republic of Indonesia 1 TOC Past Activities Activities in 2007 Activities Plan 2008, 2009 National Language Year 2008 Past NLP Research Projects in Indonesia Indonesian Text-To-Speech (BPPT, ITB, UI) GDA/MMA/Linguistic-DS MPEG-7 (Multimedia Annotation) Cross-Linguistic Portal (dictionaries, corpus, tools) Web translator (WebTRans) Standard Indonesian Language Corpus (SILC) Indonesian Language Dictionaries Project (KBBI) English-Indonesia Parallel Corpus (INCI) Speech recognition/synthesis system (Bandung Institute of Technology/ Telkom RDC/University of Indonesia) Information retrieval (ITB and University of Indonesia) Text/Image processing tools (Gajah Mada University) Computational lexicon (National Language Center) Computational morphology (Atmajaya University) 3 Promotion of Language Technologies (2007) National Language Congress XII in Solo introducing toolkit to build speech database for endangered languages and Atmajaya Language Workshop (June 2007) in Jakarta on promoting local computing policy and speech technologies (both keynote speeches by Dr. Hammam Riza) Promotion of Context Sensitive Dictionary Project for Speech Translation Corpus for Aceh Tsunami Region; (IndonesianAcehnese, bidirectional) 4 Activities in Machine Translation (2006-2007) Rule-based system Indonesian-English translator (started in 2006) was launched to the market June 2007 by ITB This translator is combined with English TTS (Windows), and Indonesian TTS (proprietary) Experiment of Statistical MT – using Pharaoh decoder (Eng-Indo parallel corpus) by 5 Current Activities in Speech Tech • Telkom RDC & BPPT collaboration on Speech Recognition and Summarization • Indonesia Goes Open Source (IGOS) speech recognition system (funded by Ministry of Research and Technology) • Speech recognition system for Bahasa Indonesia (University of Indonesia) – Transcribing speech data that contains broadcast TV and Radio news – Applications: • sending short message service (sms) • IVR ( health and tourism services) • Research for “intonation by example” and “automatic prosody pattern extractor” using Artificial Neural Network (ANN) • Text to Speech system for local languages (ITB/UI) 6 100th Year of Bahasa Indonesia – National Language Year 2008 Series of event culminating at the International Conference on Bahasa Indonesia (Oct 2008) Importance of Indonesian – Its roles, functions in national life & development (policy making, business, media, education) Language planning (shaping change) 6 keynote speakers from AFNLP will be invited by Indonesian government through out the year Major Activities for 2008 Local Language Resource Projects (Language Center) Indonesian and Local Languages - Wordnet MALINDO (Malaysia-Indonesia) joint projects Speech to speech translation for Asian languages (A-STAR) Speech database Telkom RDC/BPPT (APT support) Language Resources and Translation English Indonesia (collaboration with PAN Localization) Speech Corpus for Local Languages (Endangered Languages) – using BLARK (ELDA) 8 Activities Plan for 2008-2009 Speech Recognition and Phrase-based Statistical Machine Translation (SMT) system for bidirectional Indonesian-English and Indonesian-Japanese Mapping and SMT for Indonesian-Regional Languages (Bahasa Nusantara) and for German, French, Chinese and Arabic (cross border languages) Information Retrieval (cross language speech retrieval) Topic Detection and Tracking (TDT) Searching and retrieving Indonesian speech data Identifying topics in speech data collection Classifying new data to the existing topics in the collection Speech Synthesis Speech Summarization Summarize the Indonesian speech documents 9 E-dictionary project National Language Center Size & Comprehensiveness: Method: corpus-based, primary data for largest print dict Kamus Besar Bahasa Indonesia (KBBI) 3rd ed. Usefulness: 200,000 entries many subject areas are covered find the words you need definitions and examples are helpful Users writers, journalists, editors, scientists, academics, teachers, students, business people, lawyers etc… Echols & Shadily’s Eng-Ind. dictionary. In Indonesia, there are at least 13 biggest local languages with at least one million speakers Javanese (75,200,000) Madurese (13,694,000) Buginese (4,000,000) Sasak (2,100,000) Rejang (1,000,000) Sundanese (27,000,000) Minangkabau (6,500,000) Balinese (3,800,000) Makassarese (1,600,000) Malay (20,000,000) Batak (5,150,000) Acehnese (3,000,000) Lampung (1,500,000) ACEH – 32 local languages EAST JAVA – 6 local languages LOCAL & CROSS-BORDER LANGUAGES 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Note: bn id kh Cross-Border Languages in Indonesia: English, Arabic, Chinese, French, German, Dutch, Japanese, etc. % Local Languages la my mm ph sg th tp vn South East Asia % English % Other Cross Boader Languages Language Digital Divide Language Preservation Survey of indigenous local languages Local computing policy will be developed for major local languages Endangered languages are identified and preserved by means of ICT Language resources collection for official and major local languages Thank You Any comments please mail to hammam@iptek.net.id