ASIA-PACIFIC TELECOMMUNITY 2nd APT/ITU Conformance and Interoperability Workshop (C&I-2) 26 August 2014, Bangkok, Thailand Document: C&I-2/ INP-12 26 August 2014 "Acceleration of R&D Towards Speech Translation Technologies in the AsiaPacific Region by U-STAR" Contact : Chiori Hori E-mail : chiori.hori@nict.go.jp National Institute of Information and Communications Technology (NICT), Japan "Acceleration of R&D Towards Speech Translation Technologies in the AsiaPacific Region by U-STAR" Chiori Hori Spoken Language Communication Laboratory National Institute of Information and Communications Technology National Institute of Information and Communications Technology Universal Communication Research Center Spoken Language Communication Laboratory Kyoto, Japan Email: slc@khn.nict.go.jp Research Target Speech Interface Real-time Audio indexing Human and Human, Human and Machine for natural communication Speech-toSpeech Translation Modality Conversion Eng lish Japa nese How can I get to NICT? Video data Jap ane se Japa nese text Speech translation for people speaking different languages Speech Recognition Speech Spoken Dialog Video data From Kyoto station Speech-to-text Communication system Dialog Management Machine Translation Smart Phone Spoken dialog system with machine Text-toSpeech Synthesized Speech Transcribed speech Video data NICT audio indexing system Public server Public Server Speech data Net work Audio indexing system Public Server Closed caption And Indexes Real-time indexing:speech transcription Query-based retrieval, audio including queries, event categories, speaker diarization ( who speak what and when) Video categorization by topics "To Create a World Without Language Barriers" by International Research Consortium How to overcome language barriers? Many different languages in the world Overcoming the language barriers is a long-held dream of mankind. Speech translation technology Breaking the language barriers http://en.wikipedia.org/wiki/List_of_language_families Multilingual Speech Translation Speech-to-Speech Translation (S2ST) A means of communication between different language speakers Speech Recognition (ASR) Japanese 「ホテルの予約をお願いします. 」 hoteruno yoyakuo o n e g a i ... Convert to Japanese phoneme sequence “h”, “o”, “t”… ホテルの 予約を お願いします. Convert to word sequence using lexicon and grammar Machine Translation (MT) Speech Synthesis (TTS) Please make a reservation for a hotel a hotel make a reservation for please Convert to English word sequence 「ホテル」⇒ “a hotel” 「予約」⇒”make a reservation for” 「お願いします」⇒ “Please” Please make a English reservation for a “I go to school” hotel Reorder Select word sequences appropriate according to waveform English grammar for English “a hotel” “please” text “make a reservation for” “please” “a hotel” Corpora History of the International Consortium (1) Network-based S2ST research by consortiums of C-STAR and A-STAR A‐STAR Network‐based S2ST C‐STAR Network‐based S2ST 1991 1992 1993 1999 2000 2006 2007 2008 2009 A‐STAR C‐STAR Japan, US, Germany (3 countries) +Korea, Italy, France, China, U.S., U.K., Switzerland, Sweden, India, (9 countries) Japan, China, Korea, Indonesia, Thailand, India (6 countries) +Vietnam, Singapore (2 countries) 2010 2011 2012 2013 Preparation for the U-STAR Research Activity Speech data for training acoustic models Parallel corpus and dictionary for training translation models from English to the target language NICT French speech Portuguese speech Turkish speech Japanese speech English speech Thai speech JP Korean speech German speech Indonesian speech Dutch speech Chinese speech Hungarian speech Hindi speech Polish speech Vietnamese speech Malay speech Japan NICT Indonesia BPPT Vietnam IOIT Pakistan KICS‐UET Mongolia NUM France CNRS‐LIMSI UK University of Shefield Belgium ESAT Korea ETRI China CASIA Singapore I2R Nepal LTK Sri Lanka UCSC Portugal INESC‐ID Germany TUM Hungary BME‐TMIT Thailand NECTEC India CDAC Bhutan DITT Mongolia MUST Philippines UPD Turkey TUBITAK Germany UUlm Hungary PPKE Speech-to-Speech translation S2ST servers S2ST Application on Smartphone MCML-based Communication libraries (CMLIB) CMLIB is implemented for the U-STAR S2ST servers CMLB S2ST Client CMLB CMLB U-STAR ASR/MT/TTS servers CMLB CMLIB CMLB CMLIB CMLB Network-based Speech-to-Speech Translation (S2ST) S2ST Client S2ST Client Japanese Speaker Communication between Different Language Speakers Thai Speaker Network S2ST Server S2ST Server S2ST Server S2ST Server S2ST Server S2ST Server ASR Module MT Module TTS Module ASR Module MT Module TTS Module Japanese Japanese → Thai Thai Thai Thai → Japanese Japanse Initiation of Standardization from Asia A-STAR Speech-to-speech Translation Demo in 8 Countries (July 2009) APT ASTAP Meeting (August 2009) From Asia to the World ASTAP 16 Plenary Session U-STAR MOU (July 2010) Discussion to develop the standardization activity more internationally, not limited to the Asian-Pacific region. -> Approved to raise the standardization draft from APT to ITU-T A-STAR to U-STAR The Universal Speech Translation Advanced Research Consortium is an international research collaboration entity aiming to break language barriers around the world through network-based speechto-speech translation (S2ST) technologies. History of the International Consortium (2) Network-based S2ST research by U-STAR A‐STAR Network‐based S2ST C‐STAR Network‐based S2ST 1991 1992 1993 1999 2000 2006 2007 2008 2009 A‐STAR C‐STAR Japan, US, Germany (3 countries) +Korea, Italy, France, China, U.S., U.K., Switzerland, Sweden, India, (9 countries) Japan, China, Korea, Indonesia, Thailand, India (6 countries) +Vietnam, Singapore (2 countries) U‐STAR Network‐based S2ST 2010 2011 2012 2013 U‐STAR +Bhutan, Mongolia, Nepal, Pakistan, Philippines, Sri Lanka (6 countries) +France, Portugal, Turkey, U.K., Germany, Hungary, Poland, Belgium, Ireland (9 counties) From ASTAP To ITU-T Recommendations for network-based speech-to-speech translation and was published by HP2/SG16. Title Recomm‐ endation (2010) F.745 H.625 http://www.itu.int/rec/T‐REC‐F.745‐201010‐I http://www.itu.int/rec/T‐REC‐H.625‐201010‐I Functional Requirements for Network‐based S2ST Architectural Requirements for Network‐based S2ST U-STAR Network-based Speech Translation S2ST servers located all over the world are connected through network. The orange-colored areas indicates the countries whose official languages are supported by U-STAR’s apps. 29 Research institutes from 24 countries/regions Preparation for the U-STAR Research Activity Parallel corpus and dictionary for training translation models from English to the target language Speech data for training acoustic models French speech Portuguese speech Turkish speech English speech German speech Dutch speech Hungarian speech Polish speech JP NICT Japanese speech Korean speech Thai speech Indonesian speech Chinese speech Hindi speech Vietnamese speech Malay speech Japan NICT Indonesia BPPT Vietnam IOIT Pakistan KICS‐UET Mongolia NUM U-STAR members Coverage of the official languages Korea ETRI China CASIA Singapore I2R Nepal LTK Sri Lanka UCSC Client App 27 MT servers, 17 ASR servers, 14 TTS servers Chat system using speech translation on a smartphone Example of Hindi Thailand NECTEC India CDAC Bhutan DITT Mongolia MUST Philippines UPD ASR using the Collected Speech Baseline AM+LM (USV) Baseline AM (USV)+Web AM (USV) AM+LM (SV) 70 34 65 32 60 30 WER (%) WER (%) AM (USV) AM (SV)+Web 28 26 24 55 50 45 40 22 35 20 30 JP TH Fig. Evaluation of Model Adaptation: Japanese (left) and Thai (right) Accuracy improvements using the collected speech VoiceTra4U on Android Data collection through iPhone and android phone application for speech translation © NICT Intraoperable Speech Communication Platform for 1) human-to-human and 2) human-to-machine Client ASR Servers MT Servers TTS Servers DM Server Back-End Server MCML (ITU‐T Standardized Protocols) Online Shopping / BookingSystems i.e.) Hotels, Stores, etc. Emergency Systems i.e.) Hospitals, Police Departments, etc. Educational Systems i.e.) VoIP Lessons, Schools Language and Domain Portability for Speech Communciation Tool using ITU-T standardized S2ST protocol Multilingual Communication Project Speech‐to‐Speech Translation 2020 ‐ 17 languages for ASR, 27 for MT, and 14 for TTS ‐ Chat for up to 5 people Olympics in Tokyo Real‐Time Indexing Video data Speech Video Spoken Dialog System Searching scenes with the sound of “explosion” Audio event Video A: 20 sec Video B: 35 sec NICT audio indexing system Scene of “Riots” How can I get to the stadium? Which game will you see? From Tokyo station?