Speaking Style Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012 Apply VC principles to a different problem… 2 E.Godoy, Speaking Style Conversion December 11, 2012 Speech Intelligibility Context Speech is often heard in adverse conditions Noisy environments Listener has difficulty hearing/understanding noise no noise Example of speech with environmental barriers: the speech is not very intelligible! How to transform speech to make it more intelligible…? 3 To make speech synthesis systems more effective E.Godoy, Speaking Style Conversion December 11, 2012 Intelligible Speaking Styles Lombard speech I. Speaker is immersed in noise Human reflex to increase the speech loudness normal Clear speech II. Listener faces barrier (noise, hearing, language,…) Speaker adapts strategy to increase speech clarity casual 4 Lombard clear E.Godoy, Speaking Style Conversion December 11, 2012 VC to improve speech intelligibility? Voice Conversion Modify speech to change the speaker identity Learn transformation from source-to-target speaker Speaking Style Conversion Modify speech to improve intelligibility Determine transformation from normal-to-intelligible style Spectral Envelope: still very important! 5 E.Godoy, Speaking Style Conversion December 11, 2012 Overview: Analyses-to-Modifications Acoustic analyses to identify (mainly spectral) characteristics of Lombard & Clear styles I. i. ii. II. Result of analyses inspire spectral modifications to improve intelligibility i. ii. 6 Average Spectra Vowel Spaces Spectral energy band boosting (corrective filters) Formant shifting (frequency warping) E.Godoy, Speaking Style Conversion December 11, 2012 Corpora Lombard-normal: Grid 8 speakers (4 male, 4 female) 50 sentences each LombardNinf96: most extreme (Lu & Cooke) Clear-casual: LUCID read sentences 7 8 speakers (4 male, 4 female) 50 sentences each Read speech: most exaggerated (Baker & Hazan) E.Godoy, Speaking Style Conversion December 11, 2012 Average Relative Spectra Recall Amplitude Scaling in DFWA log( Aq ( f )) log( S qy ( f )) log( S qx (Wq1 ( f ))) Average Relative spectra is similar: difference between normal (X) and intelligible (Y) style Average across all frames log( S R ( f )) log( S qY ( f )) log( S qX ( f )) 8 E.Godoy, Speaking Style Conversion December 11, 2012 Average Relative Spectra (by Speaker) GRID Average Relative Spectra for each speaker LUCID Average Relative Spectra for each speaker 5 4 2 dB dB 0 -5 0 -2 -10 -4 8 8 6 6000 4 4000 0 Hz Lombard-normal 9 6000 4 4000 2000 2 speaker index 6 2000 2 speaker index 0 Hz Clear-casual E.Godoy, Speaking Style Conversion December 11, 2012 Average Relative Spectra (Overall) Average Relative Spectra: All frames, All speakers 6 Lombard speech: Spectral energy boosting “where formants are” (~500-4500Hz) Clear speech: Varies depending on speaker strategy, extent of differences mild overall Lombard-normal Clear-casual 4 2 dB 0 -2 -4 -6 -8 0 1000 10 2000 3000 4000 Hz 5000 6000 7000 8000 E.Godoy, Speaking Style Conversion December 11, 2012 Vowel Spaces (average for all speakers) Lombard-normal: Vowel Space, ALL Speakers Clear-casual: Vowel Space, ALL Speakers 2400 2600 normal lombard 2200 casual clear 2400 2200 2000 1800 F2 (Hz) F2 (Hz) 2000 1600 1800 1600 1400 1400 1200 1200 1000 1000 350 400 11 450 500 550 F1 (Hz) 600 650 700 800 300 350 400 450 500 550 600 F1 (Hz) 650 700 Lombard speech: Vowel Space Translation Clear speech: Vowel Space Expansion E.Godoy, Speaking Style Conversion December 11, 2012 750 800 Inspiration for Speech Modifications Spectral energy band boosting (Lombard) Vowel space expansion (Clear) 1. 2. Features attributed with increased speech intelligibility Though not observed together in human speech production… Signal processing algorithms can accomplish both! 12 E.Godoy, Speaking Style Conversion December 11, 2012 Spectral Energy Band Boosting Corrective Filters Spectral Energy Band Boosting, Varying Gain 0:0.5:3 Average Correction Filter for All Speakers 20 15 all frames Enhanced (Lombard: high SII 10 15 10 5 dB dB 5 0 0 -5 -5 -10 -15 -10 0 1000 2000 3000 4000 Hz 5000 6000 7000 -15 8000 0 Lombard-inspired & Enhanced (high SII) 13 1000 2000 3000 4000 Hz 5000 6000 7000 Corrective Filter: Varying Gain E.Godoy, Speaking Style Conversion December 11, 2012 8000 Frequency Warping for VS Expansion Clear-casual: Vowel Space, ALL Speakers LUCID: Frequency differences for F1, F2; ALL 2600 150 casual clear 2400 100 2200 50 2000 1800 Hz F2 (Hz) 0 -50 1600 -100 1400 1200 -150 1000 -200 800 300 350 400 450 500 550 600 F1 (Hz) 650 700 750 800 -250 F1diff F2diff 0 500 1000 1500 2000 Casual F1 and F2 (Hz) Curve fitting formant shifts inspires warping… 14 E.Godoy, Speaking Style Conversion December 11, 2012 2500 3000 Sound Samples With Noise (SSN, 0dB) Original Warp Boost BW 15 No Noise Original WarpE Boost BW E.Godoy, Speaking Style Conversion December 11, 2012 Want more ? See Maria’s presentation for more details … 16 E.Godoy, Speaking Style Conversion December 11, 2012 Voice & Speaking Style Conversion Parallels Voice Conversion Dynamic Frequency Warping + Amplitude Scaling (based on acoustic-phonetic spaces of source & target speakers) Speaking Style Conversion Frequency Warping + Corrective Filter 1. 2. 17 Clear-speech inspired frequency warping for vowel space expansion Lombard-speech inspired corrective filters to increase loudness E.Godoy, Speaking Style Conversion December 11, 2012 Thank you! More Questions? Extras… Objective Metrics for Evaluation Loudness I. Energy in frequency bands weighted based on human hearing Speech Intelligibility Index (SII) II. 20 Energy & modulations in frequency bands relative to a noise masker E.Godoy, Speaking Style Conversion December 11, 2012 Loudness Distributions Loudness Histogram Loudness Histogram normal lombard 0.03 casual clear 0.045 0.04 0.025 0.035 0.03 0.02 0.025 0.015 0.02 0.015 0.01 0.01 0.005 0.005 0 0.5 21 1 1.5 2 2.5 3 Loudness value 3.5 4 4.5 5 0 0.5 1 1.5 2 2.5 Loudness value 3 Lombard speech: “louder” for voiced (bi-modal) Clear speech: not “louder” than casual speech Transients: neither style distinguishes on average E.Godoy, Speaking Style Conversion December 11, 2012 3.5 4 4.5 Extended SII Distributions extended SII Histogram extended SII Histogram 0.03 0.03 normal lombard casual clear 0.025 0.025 0.02 0.02 0.015 0.015 0.01 0.01 0.005 0.005 0 0.1 0.2 0.4 0.5 SII 0.6 0.7 0.8 0.9 0 0.1 0.2 0.3 0.4 extSII highly correlated with ave loudness Lombard speech objectively more intelligible Clear speech intelligibility gain not captured by extSII 22 0.3 0.5 SII 0.6 0.7 limitations of objective intelligibility metrics E.Godoy, Speaking Style Conversion December 11, 2012 0.8 0.9 Observations from Analyses Lombard Speech Spectral boosting in inclusive formant region Vowel space translation, but no expansion Clear Speech Small changes in average spectra (slight spectral “flattening”) Consistent vowel space expansion Increase in Loudness (also extSII) Greater vowel discrimination Comparison between styles Acoustic differences 23 translate into perceptual distinctions linked to intelligibility gains Spectral boosting & Vowel space expansion: mutually exclusive E.Godoy, Speaking Style Conversion December 11, 2012