Synthesized Audio Descriptions Hironobu Takagi, Chieko Asakawa IBM Research – Tokyo © 2010 IBM Corporation National Women's Education Center - July 6th, 2010. IBM History of Accessibility 1960s Talking Typewriter 1975 1403 Braille Printer 1984 Talking 3270 Terminal 1988 ScreenReader/DOS 1990 VoiceType™ 1994 Screen Magnifier™/2 1960s Talking Typewriter 1997 Home Page Reader 1984 Talking 3270 Terminal 1998 ViaVoice® 2000 Accessibility Center 2004 aDesigner 2007 aiBrowser for Multimedia 2007 Eclipse Accessibility Tools Framework 2008 Social Accessibility 2009 ARIA (Accessible Rich Internet Application) 2 1999 Home Page Reader Japanese, Italian, French, German, Spanish, US English, UK English © 2010 IBM Corporation IBM Research - Tokyo Status of Audio Descriptions in Japan Movies 12.0% Ratio of Japanese movies with Captions (2008) 0.9% Ratio of Japanese movie with Audio Descriptions from NPO Media Access Support Center Public TV TV Private TV Public TV Private 49.4%, 42.3% 5.6%, 0.4% Ratio of TV Programs with captions (2008) (*1) Ratio of TV Programs with Audio Descriptions (2008) (*2) *1 :Ministry of Internal Affair and Communication (2008) *2 :NICT: National Institute of Information and Communications Technology 3 © 2010 IBM Corporation IBM Research - Tokyo Captions and Audio Descriptions for TV Programs 60% 50% Captions - Public 40% Captions - Private 30% Audio descriptions Public 20% Audio descriptions Public (Education) 10% Audio descriptions Private 0% 2001 2002 2003 2004 2005 2006 2007 2008 based on data from MIC and NICT 4 © 2010 IBM Corporation IBM Research - Tokyo Problems: Workload and Cost Workload Captions Audio descriptions Recording Transcribing Transcribing 5 Recording an audio description calls for a skilled narrator and a good recording environment. Writing an audio description script requires special expertise to describe the scenes between dialogues and scene changes. © 2010 IBM Corporation IBM Research - Tokyo History of Text-to-speech Engines 1980 1990 1985 IBM 1983年 DecTalk 6 2000 1996 ProTalker(IBM) 2004 Super Voice (IBM) 2010 2008 Emotional TTS (IBM) 2004 Super Voice (IBM) © 2010 IBM Corporation IBM Research - Tokyo Possible Reduction of Workload Workload Current audio descriptions Reduction by Synthesis Recording Recording Reduction by Tool support Transcribing 7 Synthesized audio descriptions Transcribing © 2010 IBM Corporation IBM Research - Tokyo Acceptance Ratio (United States) Method Online Survey Participants 236 (39 low-vision, 197 blind) Genre Education and documentary Voice quality Human and TTS(Heather) 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Uncomfortable Slightly Uncomfortable Neutral Acceptable Comfortable Set 1 Set 2 Set 3 Set 4 Constantly 70%~80% answered more than neutral 8 視聴覚障害者向け音声ガイド・字幕記述のための標準仕様の研究開発 © 2010 IBM Corporation IBM Research - Tokyo Video Accessibility Project: Goals Prove feasibility of text-based audio descriptions via user studies. – Work with professional teams for audio descriptions – Japan – IBM with CAP and content from NHK – U.S. - WGBH Create an open source platform for audio descriptions and captions – Authoring tools and players – Captions and text-based audio descriptions – Based on Eclipse.org Accessibility Tools Framework (ACTF) Contribute to standardization of Internet media accessibility – Focus on “missing markups” in the existing standards. – Maintain neutrality for existing standards. – HTML5 is the primary target. Supported by the Japanese government agency NICT (National Institute of Information and Communications Technology) 9 © 2010 IBM Corporation Thank you! © 2010 IBM Corporation IBM Research - Tokyo ACTF Script Editor Authoring tool, specialized for audio descriptions. Flexible to import and export various formats. Planned for release as open source in March. 11 © 2010 IBM Corporation IBM Research - Tokyo Case of the audio guide for the museum / the stage Museums : There are many actual usage of audio guide in museum and art museum.(The main purpose of audio guide is not to support person with visually impaired but to help everyone for studying the contents.) – [for example : provider of audio guide] • National Museum of Nature and Science,Tokyo • The National Museum of Western Art • Hiroshima Museum of Art • Osaka Museum of Natural History • Tokyo Museum of Fire Department • Shimane Museum of Ancient Izumo. – Almost every museum in Japan provides audio guide. – Generally, audio guide equipment is specially designed and made with prerecorded voice by manufacture. There is a new approach for using NINTENDO DS and downloading the content in it at the museum. The stage : Mini-drama group is main. – [for example : provider of audio guide] • Drama group "Bakkari-Bakkari" provides audio guide once in a performance period. • A drama group in the city of Kawasaki, Kanagawa Pref. • A drama group "DORA" – About caption, for example, SHIKI THEATRE COMPANY provides caption. There is very few case that large-scale theatre play provides audio guide. © 2010 IBM Corporation IBM Research - Tokyo Laws and Regulations 1993 Act on Advancement of Facilitation Program for Disabled Persons' Use of Telecommunications and Broadcasting Services, with a View to Enhance Convenience of Disabled Persons (1993) 1997 MIC defined a goal to “provide captions to all TV programs by 1997” 1998 BROADCAST LAW – Article 3-2 (4) – Any broadcaster shall, in compiling the broadcast programs for domestic broadcasting, provide as many broadcasting programs as possible which provide voices and other sounds to explain about transient images of fixed or moving objects for blind persons, and providing characters or patterns to explain about voices and other sounds for deaf persons. 2007 Signed the “Convention on the Rights of Persons with Disabilities” 2010 New JIS (Japanese Industrial Standard) for Web Accessibiltiy – Technical guidelines are fully harmonized with WCAG 2.0 13 © 2010 IBM Corporation IBM Research - Tokyo ACTF aiBrowser 1 Direct audio control Allow users to increase or lower the volume, stop or play, and control audio speed by using simple keyboard commands. 2 User interface simplification Structurally simplify interfaces by converting dynamic visual interfaces into static text-based interfaces Dynamically add alternative texts to images and buttons 3 Audio descriptions with text Infrastructure to provide video descriptions at low cost 14 14 © 2010 IBM Corporation IBM Research - Tokyo Status of Audio Descriptions in Japan Movies 0.9% 12.0% Ratio of Japanese movies with Captions (2008) Ratio of Japanese movie with Audio Descriptions from NPO Media Access Support Center Public TV TV Private TV Public TV Private 49.4%, 42.3% 5.6%, 0.4% Ratio of TV Programs with captions (2008) (*1) Ratio of TV Programs with Audio Descriptions (2008) (*2) *1 :Ministry of Internal Affair and Communication (2008) *2 :NICT: National Institute of Information and Communications Technology Internet 0.2% Ratio of video content with captions in the Open Courseware project. (2 among 1,474) 0.0% Popular video sharing services and educational online videos, but no videos with audio descriptions (except for videos prepared as examples of audio descriptions). Team investigation 15 © 2010 IBM Corporation IBM Research - Tokyo Analysis of Standards and Possible Focus Layer of Markups (vocabulary lists) for text-based audio descriptions Personalization Association with video contents, multilingual, etc. Mozilla <itext>, etc. Index structure for video (Scenes and chapters, etc.) Each video format has its own specifications. (DVD, MPEG, etc.) Unique for audio descriptions (extended, audio control, block, etc.) FOCUS AREA! Voice styles and emotional expressions W3C SSML, W3C etc. Emotion ML Description (textual information) SRT Addressing (timing) 16 W3C SMIL W3C TT DFXP Flexible addressing © 2010 IBM Corporation IBM Research - Tokyo 2nd study: Level of Description Rate of correct answers for each level of description heard once or twice Rate of Correct Answers 100% 80% 60% 40% 20% 0% 30% Normal Extended 1 2 Number of Listening Using the extended description and listening twice both improved the comprehension. 17 © 2010 IBM Corporation IBM Research - Tokyo Difficulties in Online Videos News Entertainment E-Learning Now is the time to create a new technical framework for audio descriptions! Historical Videos 18 Consumer-Generated Videos © 2010 IBM Corporation IBM Research - Tokyo Prior Projects e-Inclusion project in Canada supported by Canadian Heritage. – CRIM (Centre de recherche informatique de Montréal) – Four-year project completed this year – Authoring tool and playback tool LiveDescribe by Ryerson University – Community-based authoring system – Authoring tool and playback tool NHK Research – Prototyped and tested TTS-based audio descriptions aiBrowser – Developed by IBM Research and contributed to Eclipse.org – Audio descriptions with Flash, QuickTime, and Windows Media Player Other trials – HTML5 + Live Region demo (Firefox team) – WebShake • Japanese online caption provider prototyped with TTS-based audio descriptions. – ACAV, etc. 19 © 2010 IBM Corporation IBM Research - Tokyo Distribution Flexibility Human voice (current model) Audio Human narrator Voice quality Authoring cost System cost High High High Low* Low High** Low* Low High Lowest Low Low*** Audio Pre-recorded synthesized audio Audio Text Synthesizer Audio Server-side synthesizer Text Synthesizer Audio Client-side synthesizer Text Text 20 Synthesizer * Server-side synthesis is better than client-side synthesis. *** Client-side software support is required. ** The systems for human voices can be reused. © 2010 IBM Corporation IBM Research - Tokyo Experimental Results (Japan) 1st study (Sep 2009) –3 blind or visually impaired participants –Face-to-face, one-to-one sessions –Focused on the voice quality, level of description, and speech speed 2nd study (Feb 2010) –24 blind or visually impaired participants –Face-to-face, small group sessions –Consisted of 4 sub-studies for long-term listening, expressive voices, describer expertise, and level of description 21 © 2010 IBM Corporation IBM Research - Tokyo 日本における字幕・音声ガイドの現状 12.0% 映画 2008年に公開された邦画のうち 字幕が提供されていた割合 0.9% 2008年に公開された邦画のうち 副音声が提供されていた割合 2008年に公開された邦画が対象 NPO Media Access Support Center資料より NHK総合 放送 在京民放 NHK総合 在京民放 49.4%, 42.3% 5.6%, 0.4% 平成20年度の総放送時間に占める字幕放送時間の割合 (*1) 平成20年度の在京キー局の地上波における解説放送の割合(*2) *1 :総務省 「平成20年度の字幕放送等の実績」報道資料より *2 :NICT: National Institute of Information and Communications Technology 資料より インターネット 0.2% オープンコースウェア(教育用コンテンツ)にお ける字幕付与率。1417本中2本。 0.0% 主要な動画配信サイト、教育用コンテンツのサン プリング調査の結果、音声ガイドの付与された動 画は見つからなかった。 本プロジェクト内での独自調査 22 視聴覚障害者向け音声ガイド・字幕記述のための標準仕様の研究開発 © 2010 IBM Corporation IBM Research - Tokyo 1st study: Results Effectiveness scores for "drama" videos w/o AD 100 with AD Effectiveness scores for "cooking" videos 100 80 80 60 60 40 40 20 20 0 0 Human Traditional TTS Modern TTS w/o AD Human Traditional TTS with AD Modern TTS The descriptions greatly improved the user experience regardless of the voice quality. The participants’ comments indicated that Modern TTS was almost comparable to a human voice though the human was still preferred. 23 © 2010 IBM Corporation IBM Research - Tokyo 2nd study: Sub-studies 1. Long-term listening – Assess if TTS-based descriptions are acceptable for listening to fulllength programs – Target videos: cartoon (comedy), drama (tragedy), documentary 2. Expressive voices – Determine if the expressive TTS improves the user experience – Target videos: cartoon (comedy), drama (tragedy) 3. Describer expertise – Assess how the describer expertise affects understanding – Target video: public service announcement (warning about fraud) 4. Level of description – Assess how the level of description and repetitive listening affects understanding – Target video: instructional program (how to fold and store clothing) 24 © 2010 IBM Corporation IBM Research - Tokyo 2nd study 25 © 2010 IBM Corporation IBM Research - Tokyo 2nd study: Long-term Listening Effectiveness scores for each video category Cartoon (Comedy) Drama (Tragedy) Documentary Frequency 20 15 10 5 0 1 2 3 Score 4 5 TTS-based descriptions were generally acceptable for full-length programs From comments, the documentary film received the highest evaluation, but that was not clear from the effectiveness scores. 26 © 2010 IBM Corporation IBM Research - Tokyo 2nd study: Describer Expertise Effectiveness scores for each describer expertise and level of description Expert (Normal) Expert (Extended) Novice (Normal) Novice (Extended) Frequency 12 9 6 3 0 1 2 3 Score 4 5 Novice (Normal) was not preferred (score: 3.0) Novice (Extended) was comparable (score: 4.3) to expert descriptions (score: 4.3 for normal, 4.6 for extended) 27 © 2010 IBM Corporation IBM Research - Tokyo Typical Client-side TTS Setting Online Video Script Editor Video Player Website Audio Description Script Metadata Repository 28 © 2010 IBM Corporation IBM Research - Tokyo W3C Web Contents Accessibility Guidelines 2.0 (2008年12月勧 告) – 1.2.5 収録済の映像コンテンツの音声ガイド (レベルAA) – 1.2.7 収録済の映像コンテンツの拡張した音声ガイド (レベル AAA) 日本 改正著作権法 (2009年6月成立 2010年1月1日施行) 日本 JIS X 8341-3:2010 (2010年6月ごろ公示予定) 29 視聴覚障害者向け音声ガイド・字幕記述のための標準仕様の研究開発 © 2010 IBM Corporation