IBM Presentations: Smart Planet Template

advertisement
Synthesized Audio Descriptions
Hironobu Takagi, Chieko Asakawa
IBM Research – Tokyo
© 2010 IBM Corporation
National Women's Education Center - July 6th, 2010.
IBM History of Accessibility
1960s Talking Typewriter
1975 1403 Braille Printer
1984 Talking 3270 Terminal
1988 ScreenReader/DOS
1990 VoiceType™
1994 Screen Magnifier™/2
1960s
Talking Typewriter
1997 Home Page Reader
1984
Talking 3270
Terminal
1998 ViaVoice®
2000 Accessibility Center
2004 aDesigner
2007 aiBrowser for Multimedia
2007 Eclipse Accessibility Tools Framework
2008 Social Accessibility
2009 ARIA (Accessible Rich Internet Application)
2
1999
Home Page Reader
Japanese, Italian, French, German, Spanish, US English, UK
English
© 2010 IBM Corporation
IBM Research - Tokyo
Status of Audio Descriptions in Japan
Movies
12.0%
Ratio of Japanese movies with Captions
(2008)
0.9%
Ratio of Japanese movie with
Audio Descriptions
from NPO Media Access Support Center
Public TV
TV
Private TV
Public TV
Private
49.4%, 42.3% 5.6%, 0.4%
Ratio of TV Programs with captions (2008) (*1)
Ratio of TV Programs with Audio Descriptions (2008) (*2)
*1 :Ministry of Internal Affair and Communication (2008)
*2 :NICT: National Institute of Information and Communications Technology
3
© 2010 IBM Corporation
IBM Research - Tokyo
Captions and Audio Descriptions for TV Programs
60%
50%
Captions - Public
40%
Captions - Private
30%
Audio descriptions Public
20%
Audio descriptions Public (Education)
10%
Audio descriptions Private
0%
2001 2002 2003 2004 2005 2006 2007 2008
based on data from MIC and NICT
4
© 2010 IBM Corporation
IBM Research - Tokyo
Problems: Workload and Cost
Workload
Captions
Audio
descriptions
Recording
Transcribing
Transcribing
5
 Recording an audio
description calls for a skilled
narrator and a good
recording environment.
 Writing an audio description
script requires special
expertise to describe the
scenes between dialogues
and scene changes.
© 2010 IBM Corporation
IBM Research - Tokyo
History of Text-to-speech Engines
1980
1990
1985
IBM
1983年
DecTalk
6
2000
1996
ProTalker(IBM)
2004
Super Voice (IBM)
2010
2008
Emotional TTS
(IBM)
2004
Super Voice (IBM)
© 2010 IBM Corporation
IBM Research - Tokyo
Possible Reduction of Workload
Workload
Current
audio
descriptions
Reduction by
Synthesis
Recording
Recording
Reduction by
Tool support
Transcribing
7
Synthesized
audio
descriptions
Transcribing
© 2010 IBM Corporation
IBM Research - Tokyo
Acceptance Ratio (United States)
 Method
Online Survey
 Participants
236 (39 low-vision, 197 blind)
 Genre
Education and documentary
 Voice quality
Human and TTS(Heather)
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Uncomfortable
Slightly Uncomfortable
Neutral
Acceptable
Comfortable
Set 1
Set 2
Set 3
Set 4
Constantly 70%~80% answered more than neutral
8
視聴覚障害者向け音声ガイド・字幕記述のための標準仕様の研究開発
© 2010 IBM Corporation
IBM Research - Tokyo
Video Accessibility Project: Goals
 Prove feasibility of text-based audio descriptions via user studies.
– Work with professional teams for audio descriptions
– Japan – IBM with CAP and content from NHK
– U.S. - WGBH
 Create an open source platform for audio descriptions and captions
– Authoring tools and players
– Captions and text-based audio descriptions
– Based on Eclipse.org Accessibility Tools Framework (ACTF)
 Contribute to standardization of Internet media accessibility
– Focus on “missing markups” in the existing standards.
– Maintain neutrality for existing standards.
– HTML5 is the primary target.
Supported by the Japanese government agency NICT
(National Institute of Information and Communications Technology)
9
© 2010 IBM Corporation
Thank you!
© 2010 IBM Corporation
IBM Research - Tokyo
ACTF Script Editor
 Authoring tool,
specialized for
audio descriptions.
 Flexible to import
and export various
formats.
 Planned for release
as open source in
March.
11
© 2010 IBM Corporation
IBM Research - Tokyo
Case of the audio guide for the museum / the stage
 Museums : There are many actual usage of audio guide in museum and art museum.(The main
purpose of audio guide is not to support person with visually impaired but to help everyone for
studying the contents.)
– [for example : provider of audio guide]
• National Museum of Nature and Science,Tokyo
• The National Museum of Western Art
• Hiroshima Museum of Art
• Osaka Museum of Natural History
• Tokyo Museum of Fire Department
• Shimane Museum of Ancient Izumo.
– Almost every museum in Japan provides audio guide.
– Generally, audio guide equipment is specially designed and made with prerecorded voice by
manufacture. There is a new approach for using NINTENDO DS and downloading the content in
it at the museum.
 The stage : Mini-drama group is main.
– [for example : provider of audio guide]
• Drama group "Bakkari-Bakkari" provides audio guide once in a performance period.
• A drama group in the city of Kawasaki, Kanagawa Pref.
• A drama group "DORA"
– About caption, for example, SHIKI THEATRE COMPANY provides caption. There is very few
case that large-scale theatre play provides audio guide.
© 2010 IBM Corporation
IBM Research - Tokyo
Laws and Regulations
 1993 Act on Advancement of Facilitation Program for Disabled Persons' Use
of Telecommunications and Broadcasting Services, with a View to Enhance
Convenience of Disabled Persons (1993)
 1997 MIC defined a goal to “provide captions to all TV programs by 1997”
 1998 BROADCAST LAW
– Article 3-2 (4)
– Any broadcaster shall, in compiling the broadcast programs for domestic
broadcasting, provide as many broadcasting programs as possible which
provide voices and other sounds to explain about transient images of fixed
or moving objects for blind persons, and providing characters or patterns to
explain about voices and other sounds for deaf persons.
 2007 Signed the “Convention on the Rights of Persons with Disabilities”
 2010 New JIS (Japanese Industrial Standard) for Web Accessibiltiy
– Technical guidelines are fully harmonized with WCAG 2.0
13
© 2010 IBM Corporation
IBM Research - Tokyo
ACTF aiBrowser
1 Direct audio control
 Allow users to increase or lower the volume, stop or play,
and control audio speed by using simple keyboard
commands.
2 User interface simplification
 Structurally simplify interfaces by converting dynamic visual
interfaces into static text-based interfaces
 Dynamically add alternative texts to images and buttons
3 Audio descriptions with text
 Infrastructure to provide video descriptions at low cost
14
14
© 2010 IBM Corporation
IBM Research - Tokyo
Status of Audio Descriptions in Japan
Movies
0.9%
12.0%
Ratio of Japanese movies with Captions
(2008)
Ratio of Japanese movie with
Audio Descriptions
from NPO Media Access Support Center
Public TV
TV
Private TV
Public TV
Private
49.4%, 42.3% 5.6%, 0.4%
Ratio of TV Programs with captions (2008) (*1)
Ratio of TV Programs with Audio Descriptions (2008) (*2)
*1 :Ministry of Internal Affair and Communication (2008)
*2 :NICT: National Institute of Information and Communications Technology
Internet
0.2%
Ratio of video content with captions in
the Open Courseware project.
(2 among 1,474)
0.0%
Popular video sharing services and educational
online videos, but no videos with audio
descriptions (except for videos prepared as
examples of audio descriptions).
Team investigation
15
© 2010 IBM Corporation
IBM Research - Tokyo
Analysis of Standards and Possible Focus
Layer of Markups (vocabulary lists)
for text-based audio descriptions
Personalization
Association with video contents,
multilingual, etc.
Mozilla <itext>, etc.
Index structure for video
(Scenes and chapters, etc.)
Each video format has its
own specifications.
(DVD, MPEG, etc.)
Unique for audio descriptions
(extended, audio control, block, etc.)
FOCUS AREA!
Voice styles and
emotional expressions
W3C SSML,
W3C
etc.
Emotion ML
Description
(textual information)
SRT
Addressing (timing)
16
W3C
SMIL
W3C TT
DFXP
Flexible addressing
© 2010 IBM Corporation
IBM Research - Tokyo
2nd study: Level of Description
Rate of correct answers for each level of description heard once or twice
Rate of Correct
Answers
100%
80%
60%
40%
20%
0%
30%
Normal
Extended
1
2
Number of Listening
Using the extended description and listening twice both improved the
comprehension.
17
© 2010 IBM Corporation
IBM Research - Tokyo
Difficulties in Online Videos
News
Entertainment
E-Learning
Now is the time to create a new
technical framework for audio descriptions!
Historical Videos
18
Consumer-Generated
Videos
© 2010 IBM Corporation
IBM Research - Tokyo
Prior Projects
 e-Inclusion project in Canada supported by Canadian Heritage.
– CRIM (Centre de recherche informatique de Montréal)
– Four-year project completed this year
– Authoring tool and playback tool
 LiveDescribe by Ryerson University
– Community-based authoring system
– Authoring tool and playback tool
 NHK Research
– Prototyped and tested TTS-based audio descriptions
 aiBrowser
– Developed by IBM Research and contributed to Eclipse.org
– Audio descriptions with Flash, QuickTime, and Windows Media Player
 Other trials
– HTML5 + Live Region demo (Firefox team)
– WebShake
• Japanese online caption provider prototyped with TTS-based audio descriptions.
– ACAV, etc.
19
© 2010 IBM Corporation
IBM Research - Tokyo
Distribution Flexibility
Human voice (current model)
Audio
Human narrator
Voice
quality
Authoring
cost
System
cost
High
High
High
Low*
Low
High**
Low*
Low
High
Lowest
Low
Low***
Audio
Pre-recorded synthesized audio
Audio
Text Synthesizer
Audio
Server-side synthesizer
Text
Synthesizer
Audio
Client-side synthesizer
Text
Text
20
Synthesizer
* Server-side synthesis is better than client-side synthesis. *** Client-side software support is required.
** The systems for human voices can be reused.
© 2010 IBM Corporation
IBM Research - Tokyo
Experimental Results (Japan)
 1st study (Sep 2009)
–3 blind or visually impaired participants
–Face-to-face, one-to-one sessions
–Focused on the voice quality, level of description, and speech speed
 2nd study (Feb 2010)
–24 blind or visually impaired participants
–Face-to-face, small group sessions
–Consisted of 4 sub-studies for long-term listening, expressive voices,
describer expertise, and level of description
21
© 2010 IBM Corporation
IBM Research - Tokyo
日本における字幕・音声ガイドの現状
12.0%
映画
2008年に公開された邦画のうち
字幕が提供されていた割合
0.9%
2008年に公開された邦画のうち
副音声が提供されていた割合
2008年に公開された邦画が対象
NPO Media Access Support Center資料より
NHK総合
放送
在京民放
NHK総合
在京民放
49.4%, 42.3% 5.6%, 0.4%
平成20年度の総放送時間に占める字幕放送時間の割合 (*1)
平成20年度の在京キー局の地上波における解説放送の割合(*2)
*1 :総務省 「平成20年度の字幕放送等の実績」報道資料より
*2 :NICT: National Institute of Information and Communications Technology 資料より
インターネット
0.2%
オープンコースウェア(教育用コンテンツ)にお
ける字幕付与率。1417本中2本。
0.0%
主要な動画配信サイト、教育用コンテンツのサン
プリング調査の結果、音声ガイドの付与された動
画は見つからなかった。
本プロジェクト内での独自調査
22
視聴覚障害者向け音声ガイド・字幕記述のための標準仕様の研究開発
© 2010 IBM Corporation
IBM Research - Tokyo
1st study: Results
Effectiveness scores for "drama" videos
w/o AD
100
with AD
Effectiveness scores for "cooking" videos
100
80
80
60
60
40
40
20
20
0
0
Human
Traditional
TTS
Modern
TTS
w/o AD
Human
Traditional
TTS
with AD
Modern
TTS
The descriptions greatly improved the user experience regardless of the voice
quality.
The participants’ comments indicated that Modern TTS was almost comparable
to a human voice though the human was still preferred.
23
© 2010 IBM Corporation
IBM Research - Tokyo
2nd study: Sub-studies
1. Long-term listening
– Assess if TTS-based descriptions are acceptable for listening to fulllength programs
– Target videos: cartoon (comedy), drama (tragedy), documentary
2. Expressive voices
– Determine if the expressive TTS improves the user experience
– Target videos: cartoon (comedy), drama (tragedy)
3. Describer expertise
– Assess how the describer expertise affects understanding
– Target video: public service announcement (warning about fraud)
4. Level of description
– Assess how the level of description and repetitive listening affects
understanding
– Target video: instructional program (how to fold and store clothing)
24
© 2010 IBM Corporation
IBM Research - Tokyo
2nd study
25
© 2010 IBM Corporation
IBM Research - Tokyo
2nd study: Long-term Listening
Effectiveness scores for each video category
Cartoon (Comedy)
Drama (Tragedy)
Documentary
Frequency
20
15
10
5
0
1
2
3
Score
4
5
TTS-based descriptions were generally acceptable for full-length
programs
From comments, the documentary film received the highest evaluation,
but that was not clear from the effectiveness scores.
26
© 2010 IBM Corporation
IBM Research - Tokyo
2nd study: Describer Expertise
Effectiveness scores for each describer expertise and level of description
Expert (Normal)
Expert (Extended)
Novice (Normal)
Novice (Extended)
Frequency
12
9
6
3
0
1
2
3
Score
4
5
Novice (Normal) was not preferred (score: 3.0)
Novice (Extended) was comparable (score: 4.3) to expert descriptions (score: 4.3
for normal, 4.6 for extended)
27
© 2010 IBM Corporation
IBM Research - Tokyo
Typical Client-side TTS Setting
Online Video
Script Editor
Video Player
Website
Audio Description Script
Metadata Repository
28
© 2010 IBM Corporation
IBM Research - Tokyo
 W3C Web Contents Accessibility Guidelines 2.0 (2008年12月勧
告)
– 1.2.5 収録済の映像コンテンツの音声ガイド (レベルAA)
– 1.2.7 収録済の映像コンテンツの拡張した音声ガイド (レベル AAA)
 日本 改正著作権法 (2009年6月成立 2010年1月1日施行)
 日本 JIS X 8341-3:2010 (2010年6月ごろ公示予定)
29
視聴覚障害者向け音声ガイド・字幕記述のための標準仕様の研究開発
© 2010 IBM Corporation
Download