2 APT/ITU Conformance and Interoperability Workshop Document:

advertisement
ASIA-PACIFIC TELECOMMUNITY
2nd APT/ITU Conformance and Interoperability Workshop
(C&I-2)
26 August 2014, Bangkok, Thailand
Document:
C&I-2/ INP-12
26 August 2014
"Acceleration of R&D Towards Speech
Translation Technologies in the AsiaPacific Region by U-STAR"
Contact :
Chiori Hori
E-mail : chiori.hori@nict.go.jp
National Institute of Information and Communications Technology
(NICT), Japan
"Acceleration of R&D Towards Speech
Translation Technologies in the AsiaPacific Region by U-STAR"
Chiori Hori
Spoken Language Communication Laboratory
National Institute of Information and Communications Technology
National Institute of Information and Communications Technology
Universal Communication Research Center
Spoken Language Communication Laboratory
Kyoto, Japan
Email: slc@khn.nict.go.jp
Research Target
Speech Interface
Real-time Audio indexing
Human and Human, Human and
Machine for natural
communication
Speech-toSpeech
Translation
Modality
Conversion
Eng
lish
Japa
nese
How can
I get to
NICT?
Video data
Jap
ane
se
Japa
nese
text
Speech translation
for people speaking
different languages
Speech
Recognition
Speech
Spoken
Dialog
Video data
From
Kyoto
station
Speech-to-text
Communication
system
Dialog
Management
Machine
Translation
Smart Phone
Spoken dialog
system with machine
Text-toSpeech
Synthesized
Speech
Transcribed speech
Video data
NICT audio
indexing system
Public
server
Public
Server
Speech
data
Net
work
Audio
indexing
system
Public
Server
Closed
caption
And
Indexes
Real-time indexing:speech transcription
Query-based retrieval, audio including queries,
event categories,
speaker diarization ( who speak what and when)
Video categorization by topics
"To Create a World Without
Language Barriers" by
International Research Consortium
How to overcome language barriers?
 Many different languages in the world
 Overcoming the language barriers is a long-held
dream of mankind.
 Speech translation technology
Breaking the language barriers
http://en.wikipedia.org/wiki/List_of_language_families
Multilingual Speech Translation
Speech-to-Speech Translation (S2ST)
A means of communication
between different language speakers
Speech
Recognition
(ASR)
Japanese
「ホテルの予約をお願いします.
」
hoteruno
yoyakuo
o n e g a i ...
Convert to
Japanese
phoneme
sequence
“h”, “o”, “t”…
ホテルの
予約を
お願いします.
Convert to
word sequence
using lexicon
and grammar
Machine
Translation
(MT)
Speech
Synthesis
(TTS)
Please make
a reservation for
a hotel
a hotel
make a reservation for
please
Convert to
English word
sequence
「ホテル」⇒ “a hotel”
「予約」⇒”make a
reservation for”
「お願いします」⇒
“Please”
Please make a
English
reservation for a
“I go to school”
hotel
Reorder
Select
word sequences
appropriate
according to
waveform
English grammar
for English
“a hotel”
“please” text
“make a reservation for”
“please”
“a hotel”
Corpora
History of the International Consortium (1)
Network-based S2ST research by consortiums of
C-STAR and A-STAR
A‐STAR
Network‐based S2ST
C‐STAR
Network‐based S2ST
1991 1992 1993
1999 2000
2006 2007 2008 2009
A‐STAR
C‐STAR
Japan,
US,
Germany
(3 countries)
+Korea, Italy,
France, China,
U.S., U.K.,
Switzerland,
Sweden,
India,
(9 countries)
Japan, China,
Korea,
Indonesia, Thailand, India
(6 countries)
+Vietnam,
Singapore
(2 countries)
2010 2011 2012 2013
Preparation for the U-STAR Research Activity
Speech data for training
acoustic models
Parallel corpus and dictionary
for training translation models
from English to the target
language
NICT
French speech
Portuguese
speech
Turkish speech
Japanese speech
English speech
Thai speech
JP
Korean speech
German speech Indonesian speech
Dutch speech
Chinese speech
Hungarian speech
Hindi speech
Polish speech
Vietnamese speech
Malay speech
Japan NICT
Indonesia BPPT
Vietnam IOIT
Pakistan KICS‐UET
Mongolia NUM
France CNRS‐LIMSI
UK University of Shefield
Belgium ESAT
Korea ETRI
China CASIA
Singapore I2R
Nepal LTK
Sri Lanka UCSC
Portugal INESC‐ID
Germany TUM
Hungary BME‐TMIT
Thailand NECTEC
India CDAC
Bhutan DITT
Mongolia MUST
Philippines UPD
Turkey TUBITAK
Germany UUlm
Hungary PPKE
Speech-to-Speech translation
S2ST servers
S2ST Application on Smartphone
MCML-based
Communication libraries (CMLIB)
CMLIB is implemented
for the U-STAR S2ST servers
CMLB
S2ST Client
CMLB
CMLB
U-STAR ASR/MT/TTS servers
CMLB
CMLIB
CMLB
CMLIB
CMLB
Network-based
Speech-to-Speech Translation (S2ST)
S2ST Client
S2ST Client
Japanese Speaker
Communication between Different Language Speakers Thai
Speaker
Network
S2ST Server
S2ST Server
S2ST Server
S2ST Server
S2ST Server
S2ST Server
ASR Module
MT Module
TTS Module
ASR Module
MT Module
TTS Module
Japanese
Japanese → Thai
Thai
Thai
Thai → Japanese
Japanse
Initiation of Standardization from Asia
 A-STAR Speech-to-speech
Translation Demo in 8 Countries
(July 2009)
 APT ASTAP Meeting (August 2009)
From Asia to the World
ASTAP 16 Plenary
Session
 U-STAR MOU
(July 2010)
Discussion to develop the
standardization activity more
internationally, not limited to
the Asian-Pacific region.
-> Approved to raise the
standardization draft from APT
to ITU-T
A-STAR to U-STAR
The Universal Speech Translation
Advanced Research Consortium is an
international research collaboration entity
aiming to break language barriers around
the world through network-based speechto-speech translation (S2ST)
technologies.
History of the International Consortium (2)
Network-based S2ST research by U-STAR
A‐STAR
Network‐based S2ST
C‐STAR
Network‐based S2ST
1991 1992 1993
1999 2000
2006 2007 2008 2009
A‐STAR
C‐STAR
Japan,
US,
Germany
(3 countries)
+Korea, Italy,
France, China,
U.S., U.K.,
Switzerland,
Sweden,
India,
(9 countries)
Japan, China,
Korea,
Indonesia, Thailand, India
(6 countries)
+Vietnam,
Singapore
(2 countries)
U‐STAR Network‐based S2ST
2010 2011 2012 2013
U‐STAR
+Bhutan,
Mongolia,
Nepal, Pakistan,
Philippines,
Sri Lanka
(6 countries)
+France, Portugal, Turkey, U.K.,
Germany, Hungary,
Poland, Belgium,
Ireland
(9 counties)
From ASTAP To ITU-T
Recommendations for network-based
speech-to-speech translation and
was published by HP2/SG16.
Title
Recomm‐
endation
(2010)
F.745
H.625
http://www.itu.int/rec/T‐REC‐F.745‐201010‐I
http://www.itu.int/rec/T‐REC‐H.625‐201010‐I
Functional Requirements for Network‐based S2ST Architectural Requirements for Network‐based S2ST U-STAR Network-based
Speech Translation
S2ST servers located all over the world
are connected through network.
The orange-colored areas indicates the countries whose official languages are supported by U-STAR’s apps.
29 Research institutes from 24
countries/regions
Preparation for the U-STAR Research Activity
Parallel corpus and dictionary
for training translation models
from English to the target language
Speech data for training
acoustic models
French speech
Portuguese speech
Turkish speech
English speech
German speech
Dutch speech
Hungarian speech
Polish speech
JP
NICT
Japanese speech
Korean speech
Thai speech
Indonesian
speech
Chinese
speech
Hindi speech
Vietnamese speech
Malay speech
Japan NICT
Indonesia BPPT
Vietnam IOIT
Pakistan KICS‐UET
Mongolia NUM
U-STAR members
Coverage of the official languages
Korea ETRI
China CASIA
Singapore I2R
Nepal LTK
Sri Lanka UCSC
Client App
27 MT servers, 17 ASR servers, 14 TTS servers
Chat system
using speech
translation
on a smartphone
Example
of Hindi
Thailand NECTEC
India CDAC
Bhutan DITT
Mongolia MUST
Philippines UPD
ASR using the Collected Speech
Baseline
AM+LM (USV)
Baseline
AM (USV)+Web
AM (USV)
AM+LM (SV)
70
34
65
32
60
30
WER (%)
WER (%)
AM (USV)
AM (SV)+Web
28
26
24
55
50
45
40
22
35
20
30
JP
TH
Fig. Evaluation of Model Adaptation: Japanese (left) and Thai (right)
Accuracy improvements
using the collected speech
VoiceTra4U on Android
Data collection through iPhone and android phone
application for speech translation
© NICT
Intraoperable Speech Communication Platform for
1) human-to-human and 2) human-to-machine
Client
ASR Servers
MT Servers
TTS Servers
DM Server
Back-End Server
MCML
(ITU‐T Standardized Protocols)
Online Shopping /
BookingSystems
i.e.) Hotels, Stores,
etc.
Emergency
Systems
i.e.) Hospitals, Police
Departments, etc.
Educational Systems
i.e.) VoIP Lessons,
Schools
Language and Domain Portability for Speech Communciation Tool
using ITU-T standardized S2ST protocol
Multilingual Communication Project
Speech‐to‐Speech Translation 2020
‐ 17 languages for ASR, 27 for MT, and 14 for TTS
‐ Chat for up to 5 people
Olympics
in Tokyo
Real‐Time Indexing
Video data
Speech
Video
Spoken Dialog System
Searching scenes with the sound of “explosion” Audio
event
Video A: 20 sec
Video B: 35 sec
NICT
audio indexing system
Scene of “Riots”
How can I
get to the
stadium?
Which
game
will you
see?
From
Tokyo
station?
Download