Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012 E.Godoy Biography United States (Rhode Island): Native I. Hometown: Middletown, RI Undergrad & Masters at MIT (Boston Area) France (Lannion): 2007-2011 II. Worked on my PhD at Orange Labs Iraklio: Current III. 2 Work at FORTH with Prof Stylianou E.Godoy, Voice Conversion December 11, 2012 Professional Background B.S. & M.Eng from MIT, Electrical Eng. I. Specialty in Signal Processing Underwater acoustics: target physics, environmental modeling, torpedo homing Antenna beamforming (Masters): wireless networks PhD in Signal Processing at Orange Labs II. Speech Processing:Voice Conversion Speech Synthesis Team (Text-to-Speech) Focus on Spectral Envelope Transformation Post-Doctoral Research at FORTH III. LISTA: Speech in Noise & Intelligibility 3 Analyses of Human Speaking Styles (e.g. Lombard, Clear) Speech Modifications to Improve Intelligibility E.Godoy, Voice Conversion December 11, 2012 Today’s Lecture: Voice Conversion Introduction to Voice Conversion I. Spectral Envelope Transformation in VC II. III. IV. 4 Speech Synthesis Context (TTS) Overview of Voice Conversion Standard: Gaussian Mixture Model Proposed: Dynamic Frequency Warping + Amplitude Scaling Conversion Results Objective Metrics & Subjective Evaluations Sound Samples Summary & Conclusions E.Godoy, Voice Conversion December 11, 2012 Today’s Lecture: Voice Conversion Introduction to Voice Conversion I. 5 Speech Synthesis Context (TTS) Overview of Voice Conversion E.Godoy, Voice Conversion December 11, 2012 Voice Conversion (VC) Transform the speech of a (source) speaker so that it sounds like the speech of a different (target) speaker. This is awesome! Voice Conversion This is awesome! He sounds like me! ??? source 6 target E.Godoy, Voice Conversion December 11, 2012 Context: Speech Synthesis Increase in applications using speech technologies Cell phones, GPS, video gaming, customer service apps… Turn left! Insert your card. Ha ha! This is Abraham Lincoln speaking… Next Stop: Lannion Information communicated through speech! Text-to-Speech! 7 Text-to-Speech (TTS) Synthesis Generate speech from a given text E.Godoy, E.Godoy, Voice Guest Conversion Lecture December 11, 2012 Text-to-Speech (TTS) Systems TTS Approaches Concatenative: speech synthesized from recorded segments 1. Unit-Selection: parts of speech chosen from corpora & strung together High-quality synthesis, but need to record & process corpora Parametric: speech generated from model parameters 2. HMM-based: speaker models built from speech using linguistic info Limited quality due to simplified speech modeling & statistical averaging Concatenative or Paramateric? 8 E.Godoy, Voice Conversion December 11, 2012 Text-to-Speech (TTS) Example voice 9 E.Godoy, Voice Conversion December 11, 2012 Voice Conversion: TTS Motivation Concatenative speech synthesis High-quality speech But, need to record & process a large corpora for each voice Voice Conversion 10 Create different voices by speech-to-speech transformation Focus on acoustics of voices E.Godoy, Voice Conversion December 11, 2012 What gives a voice an identity? “Voice” notion of identity (voice rather than speech) Characterize speech based on different levels 1. Segmental Pitch – fundamental frequency Timbre – distinguishes between different types of sounds 2. Supra-Segmental Prosody – intonation & rhythm of speech 11 E.Godoy, Voice Conversion December 11, 2012 Goals of Voice Conversion Synthesize High-Quality Speech 1. Maintain quality of source speech (limit degradations) Capture Target Speaker Identity 2. Requires learning between source & target features Difficult task! 12 significant modifications of source speech needed that risk severely degrading speech quality… E.Godoy, Voice Conversion December 11, 2012 Stages of Voice Conversion 1) Analysis, 2) Learning, 3) Transformation TARGET speech PARAMETER EXTRACTION LEARNING • Generating Acoustic Feature Spaces • Establish Mappings between Source & Target Parameters CORPORA SOURCE speech PARAMETER EXTRACTION TRANSFORMATION • Classify Source Parameters in Feature Space • Apply Transformation Function 13 Converted Speech SYNTHESIS Key Parameter: the spectral envelope (relation to timbre) E.Godoy, Voice Conversion December 11, 2012 Today’s Lecture: Voice Conversion Introduction to Voice Conversion I. Speech Synthesis Context (TTS) Overview of Voice Conversion Spectral Envelope Transformation in VC II. 14 Standard: Gaussian Mixture Model Proposed: Dynamic Frequency Warping + Amplitude Scaling E.Godoy, Voice Conversion December 11, 2012 The Spectral Envelope Spectral Envelope: curve approximating the DFT magnitude 0 -20 -40 -60 -80 -100 -120 15 1000 2000 3000 4000 5000 6000 7000 8000 Related to voice timbre, plays a key role in many speech applications: 0 Coding, Recognition, Synthesis,Voice transformation/conversion Voice Conversion: important for both speech quality and voice identity E.Godoy, Voice Conversion December 11, 2012 Spectral Envelope Parameterization Two common methods 1) Cepstrum - Discrete Cepstral Coefficients - Mel-Frequency Cepstral Coefficients (MFCC) change the frequency scale to reflect bands of human hearing 2) Linear Prediction (LP) - Line Spectral Frequencies (LSF) 16 E.Godoy, Voice Conversion December 11, 2012 Standard Voice Conversion Extract target parameters Corpora (Parallel) Align source & target frames Extract source parameters LEARNING TRANSFORMATION Converted Speech synthesis Focus: Learning & Transforming the Spectral Envelope Parallel corpora: source & target utter same sentences Parameters are spectral features (e.g. vectors of cepstral coefficients) Alignment of speech frames in time Standard: Gaussian Mixture Model 17 E.Godoy, Voice Conversion December 11, 2012 Today’s Lecture: Voice Conversion Introduction to Voice Conversion I. Speech Synthesis Context (TTS) Overview of Voice Conversion Spectral Envelope Transformation in VC II. Standard: Gaussian Mixture Model Formulation Limitations 1. 2. 18 Acoustic mappings between source & target parameters Over-smoothing of the spectral envelope Proposed: Dynamic Frequency Warping + Amplitude Scaling E.Godoy, Voice Conversion December 11, 2012 Gaussian Mixture Model for VC Origins: Underlying principle: Evolved from "fuzzy" Vector Quantization (i.e. VQ with "soft" classification) Originally proposed by [Stylianou et al; 98] Joint learning of GMM (most common) by [Kain et al; 98] Exploit joint statistics exhibited by aligned source & target frames Methodology: 19 Represent distributions of spectral feature vectors as mix of Q Gaussians Transformation function then based on MMSE criterion E.Godoy, Voice Conversion December 11, 2012 GMM-based Spectral Transformation 1) Align N spectral feature vectors in time. (discrete cepstral coeffs) source: X x1,..., xN , target: Y y1,..., yN , joint : Z X ,Y 2) Represent PDF of vectors as mixture of Q multivariate Gaussians Q p( z ) Q N ( z; , ), q q q 1 q q 1, q 0 q 1 Learn q , q , q , q 1 : Q from Expectation Maximization (EM) on Z 3) Transform source vectors using weighted mixture of Maximum Likelihood (ML) estimator for each component. Q yˆ n ( xn ) wqx ( xn ) qy qyx qxx q 1 1 ( xn qx ) wqx ( xn ) : probability source frame belongs to acoustic class described by component q (calculated in Decoding) E.Godoy, Voice Conversion GMM-Transformation Steps 1) Source frame xn want to estimate target vector: yˆ n x 2) Classify xn calculate wq ( xn ) wqx ( xn ) : probability source frame belongs to acoustic class described by component q (Decoding step) 3) Apply transformation function: Q yˆ n ( xn ) wqx ( xn ) qy qyx qxx q 1 1 ( xn qx ) weighted sum ML estimator for class E.Godoy, Voice Conversion Acoustically Aligning Source & Target Speech First question: are the acoustic events from the source & target speech appropriately associated? Time alignment Acoustic alignment ??? One-to-Many Problem Occurs when single acoustic event of source aligned to multiple acoustic events of target [Mouchtaris et al; 07] Problem is that cannot distinguish distinct target events given only source information Source: one single event As , Target: two distinct events Bt, Ct [As ; Bt] [As ; Ct] Joint Frames 22 Bt 0.8 0.7 0.6 0.5 0.4 As 0.3 0.2 0.1 Ct 0 -4 -2 0 Acoustic Space E.Godoy, Voice Conversion December 11, 2012 2 4 6 8 Cluster by Phoneme (“Phonetic GMM”) Motivation: eliminate mixtures to alleviate one-to-many problem! Introduce contextual information Phoneme (unit of speech describing linguistic sound): /a/,/e/,/n/, etc Formulation: cluster frames according to phoneme label Each Gaussian class q then corresponds to a phoneme: Nq 1 Nq 1 Nq T q z , ( z )( z ) , l q l l l l q N q l 1 N q l 1 N N q framesfor phonemeq 23 Outcome: error indeed decreases using classification by phoneme Specifically, errors from one-to-many mappings reduced! E.Godoy, Voice Conversion December 11, 2012 Still GMM Limitations for VC Unfortunately, converted speech quality poor original converted What about the joint statistics in the GMM? Difference between GMM and VQ-type (no joint stats) transformation? GMM : joint statistics Q wqx ( xn ) qy qyx qxx ( xn qx ) q 1 1 VQ type : Q w ( x ) q 1 24 x q n y q No significant difference! (originally shown [Chen et al; 03]) E.Godoy, Voice Conversion December 11, 2012 "Over-Smoothing" in GMM-based VC The Over-Smoothing Problem: (Explains poor quality!) Transformed spectral envelopes using GMM are "overly-smooth" Loss of spectral details converted speech "muffled", "loss of presence" Cause: (Goes back to those joint statistics…) Low inter-speaker parameter correlation (weak statistical link exhibited by aligned source & target frames) Q yˆ n ( xn ) wqx ( xn ) qy qyx q 1 xx 1 q Q ( xn qx ) wqx ( xn )qy class mean q 1 Frame-level alignment of source & target parameters not effective! 25 E.Godoy, Voice Conversion December 11, 2012 “Spectrogram” Example (GMM over-smoothing) Spectral Envelopes across a sentence. Y DFWA DFWA GMM Phonetic Phonetic X GMM X Y 0 50 50 50 50 50 50 50 0 100 100 100 100 100 100 100 0 150 150 150 150 150 150 150 0 200 200 200 200 200 200 200 0 250 250 250 250 250 250 250 0 2000 4000 Hz 26 6000 00 2000 2000 4000 4000 Hz Hz 6000 6000 00 2000 2000 4000 4000 Hz Hz 6000 6000 00 2000 2000 4000 4000 Hz Hz 6000 6000 E.Godoy, Voice Conversion 0 2000 4000 Hz 6000 December 11, 2012 Today’s Lecture: Voice Conversion Introduction to Voice Conversion I. Speech Synthesis Context (TTS) Overview of Voice Conversion Spectral Envelope Transformation in VC II. Standard: Gaussian Mixture Model Proposed: Dynamic Frequency Warping + Amplitude Scaling 27 Related work DFWA description E.Godoy, Voice Conversion December 11, 2012 Dynamic Frequency Warping (DFW) 28 Dynamic Frequency Warping [Valbret; 92] Warp source spectral envelope in frequency to resemble that of target Alternative to GMM No transformation with joint statistics! Maintains spectral details higher quality speech Spectral amplitude not adjusted explicitly poor identity transformation E.Godoy, Voice Conversion December 11, 2012 Spectral Envelopes of a Frame: DFW 00 Y Y X X DFW -10 -10 -20 -20 dB -30 -30 -40 -40 -50 -50 -60 -60 -70 -70 00 29 1000 1000 2000 2000 3000 3000 4000 4000 Hz Hz 5000 5000 6000 6000 E.Godoy, Voice Conversion 7000 7000 8000 8000 December 11, 2012 Hybrid DFW-GMM Approaches Goal to adjust spectral amplitude after DFW Combine DFW- and GMM- transformed envelopes Rely on arbitrary smoothing factors [Toda; 01],[Erro; 10] Impose compromise between 1. maintaining spectral details (via DFW) 2. respecting average trends (via GMM) 30 E.Godoy, Voice Conversion December 11, 2012 Proposed Approach: DFWA Observation: Do not need the GMM to adjust amplitude! Alternatively, can use a simple amplitude correction term Dynamic Frequency Warping with Amplitude Scaling (DFWA) S dfwa,n ( f ) Aq ( f ) S xn (Wq1 ( f )) S xn (Wq1 ( f )) containsspectraldetails; Aq ( f ) respectsaveragetrends 31 E.Godoy, Voice Conversion December 11, 2012 Levels of Source-Target Parameter Mappings (3) Class-level: Global (DFWA) (1) Frame-level: Standard (GMM) 32 E.Godoy, Voice Conversion December 11, 2012 Recall Standard Voice Conversion Extract target parameters Corpora (Parallel) Align source & target frames Extract source parameters LEARNING TRANSFORMATION Converted Speech synthesis Parallel corpora: source & target utter same sentences Alignment of individual source & target frames in time Mappings between acoustic spaces determined on frame level 33 E.Godoy, Voice Conversion December 11, 2012 DFW with Amplitude Scaling (DFWA) Associate source & target parameters based on class statistics Unlike typical VC, no frame alignment required in DFWA! 1. 2. 3. Define Acoustic Classes DFW Amplitude Scaling Extract target parameters Clustering Corpora DFW Extract source parameters 34 Amplitude Scaling Converted Speech synthesis Clustering E.Godoy, Voice Conversion December 11, 2012 Defining Acoustic Classes Acoustic classes built using clustering, which serves 2 purposes: 1. 2. Choice of clustering approach depends on available information: 35 Classify individual source or target frames in acoustic space Associate source & target classes Acoustic information: With aligned frames: joint clustering (e.g. joint GMM or VQ) Without aligned frames: independent clustering & associate classes (e.g. with closest means) Contextual information: use symbolic information (e.g. phoneme labels) Outcome: q=1,2…Q acoustic classes in a one-to-one correspondence E.Godoy, Voice Conversion December 11, 2012 DFW Estimation Compares distributions of observed source & target spectral envelope peak frequencies (e.g. formants) Global vision of spectral peak behavior (for each class) Only peak locations (and not amplitudes) considered DFW function: piecewise linear Intervals defined by aligning maxima in spectral peak frequency distributions – Dijkstra algorithm: min sum abs diff between target & warped source distributions Wq ( f ) Bq,m f Cq,m f f qx,im , f qx,im1 ( f qx,im , f qy, jm ), m 1,...,M q Bq ,m f qy, jm1 f qy, jm f qx,im1 f qx,im Cq ,m f qy, jm Bq ,m f qx,im E.Godoy, Voice Conversion DFW Estimation Peak Occurrence Distributions 0.015 source 0.01 0.005 0 0 1000 2000 3000 4000 5000 Frequency (Hz) 6000 7000 8000 Clear global trends (peak locations) Avoid sporadic frame-to-frame comparisons Dijkstra algorithm selects pairs DFW statistically aligning most probable spectral events (peak locations) 0.015 target target warped source 0.01 0.005 0 0 1000 2000 3000 4000 5000 Frequency (Hz) 6000 7000 Amplitude scaling then adjusts differences between the target & warped source spectral envelopes 8000 E.Godoy, Voice Conversion Spectral Envelopes of a Frame 0 Y Y X X DFW -10 -20 dB -30 -40 -50 -60 -70 38 0 1000 2000 3000 4000 Hz 5000 6000 E.Godoy, Voice Conversion 7000 8000 December 11, 2012 Amplitude Scaling Goal of amplitude scaling: Adapt frequency-warped source envelopes to better resemble those of target estimation using statistics of target and warped source data For class q, amplitude scaling function Aq ( f ) defined by: log( Aq ( f )) log( S qy ( f )) log( S qx (Wq1 ( f ))) - Comparison between source & target data strictly on acoustic class-level - Difference in average log spectra No arbitrary weighting or smoothing! E.Godoy, Voice Conversion DFWA Transformation Transformation (for source frame n in class q): S dfwa,n ( f ) Aq ( f ) S xn (Wq1 ( f )) S xn (Wq1 ( f )) containsspectraldetails; Aq ( f ) respectsaveragetrends log( Aq ( f )) log( S qy ( f )) log( S qx (Wq1 ( f ))) In discrete cepstral domain: yˆ ndfwa qy n q (log spectrum parameters) n cepstralcoefficients for S (W ( f )) xn 1 q q mean of n , representslog(S x (Wq1 ( f ))) n 40 Average target envelope respected ("unbiased estimator") Captures timbre! E.Godoy, Voice Conversion Spectral details maintained: difference between frame realization & average Ensures Quality! December 11, 2012 Spectral Envelopes of a Frame 0 Y YY Y X XX X DFW DFW DFW DFWA DFWA Phonetic GMM -10 -20 dB -30 -40 -50 -60 -70 41 0 1000 2000 3000 4000 Hz 5000 6000 E.Godoy, Voice Conversion 7000 8000 December 11, 2012 Spectral Envelopes across a Sentence X Phonetic GMM DFWA Y 50 50 50 50 100 100 100 100 150 150 150 150 200 200 200 200 250 250 250 250 zoom 0 42 2000 4000 Hz 6000 0 2000 4000 Hz 6000 0 2000 4000 Hz 6000 E.Godoy, Voice Conversion 0 2000 4000 Hz 6000 December 11, 2012 Spectral Envelopes across a Sound X 265 270 275 DFWA looks like natural speech 280 285 0 500 1000 1500 2000 Hz 2500 3000 3500 Phonetic GMM 265 270 275 can see warping appropriately shifting source events overall, DFWA more closely resembles target 280 285 0 500 1000 1500 2000 Hz 2500 3000 3500 2500 3000 3500 DFWA Important to note that examining spectral envelope evolution in time very informative! Can see poor quality with GMM right away (less evident w/in frame) 265 270 275 280 285 0 500 1000 1500 2000 Hz Y 265 270 275 280 285 0 500 1000 1500 2000 Hz 2500 3000 3500 E.Godoy, Voice Conversion Today’s Lecture: Voice Conversion Introduction to Voice Conversion I. Speech Synthesis Context (TTS) Overview of Voice Conversion Spectral Envelope Transformation in VC II. III. 44 Standard: Gaussian Mixture Model Proposed: Dynamic Frequency Warping + Amplitude Scaling Conversion Results Objective Metrics & Subjective Evaluations Sound Samples E.Godoy, Voice Conversion December 11, 2012 Formal Evaluations Speech Corpora & Analysis Spectral envelope transformation methods: 45 CMU ARCTIC, US English speakers (2 male, 2 female) Parallel annotated corpora, speech sampled at 16kHz 200 sentences for learning, 100 sentences for testing Speech analysis & synthesis: Harmonic model All spectral envelopes discrete cepstral coefficients (order 40) Phonetic & Traditional GMMs (diagonal covariance matrices) DFWA (classification by phoneme) DFWE (Hybrid DFW-GMM, E-energy correction) source (no transformation) E.Godoy, Voice ConversionDecember 11, 2012 Objective Metrics for Evaluation Mean Squared Error (standard) alone is not sufficient! Does not adequately indicate converted speech quality Does not indicate variance in transformed data Variance Ratio (VR) Average ratio of the transformed-to-target data variance More global indication of behavior Does the transformed data behave like the target data? 46 Is the transformed data varying from the mean? E.Godoy, Voice Conversion December 11, 2012 Objective Results GMM-based methods yield lowest MSE but no variance DFWA yields higher MSE but more natural variations in speech VR of DFWA directly from variations in the warped source envelopes (not model adaptations!) Hybrid (DFWE) in-between DFWA & GMM (as expected) Confirms over-smoothing Compared to DFWA, VR notably decreased after energy correction filter Different indications from MSE & VR: both important, complementary E.Godoy, Voice Conversion DFWA for VC with Nonparallel Corpora Parallel Corpora: big constraint ! DFWA for Nonparallel corpora Limits the data that can be used in VC Nonparallel Learning Data: 200 disjoint sentences MSE (dB) VR Parallel -7.74 0.930 Nonparallel -7.71 0.936 DFWA equivalent for parallel or nonparallel corpora! E.Godoy, Voice Conversion Implementation in a VC System s x (t ) Speech Analysis (Harmonic) source speech Spectral Envelope Transformation Speech Frame Synthesis (pitch modification) TD-PSOLA (pitch modification) s yˆ (t ) converted speech Converted speech synthesis Pitch modification using TD-PSOLA (Time-Domain Pitch Synchronous Overlap Add) Pitch modification factor determined by classic linear transformation y f f fy x f 0x xf f yˆ 0 Frame synthesis voiced frames: 0 0 0 0 harmonic amplitudes from transformed spectral envelopes harmonic phases from nearest-neighbor source sampling unvoiced frames: white noise through AR filter 49 E.Godoy, Voice Conversion December 11, 2012 Subjective Testing Listeners evaluate 1. 2. Mean Opinion Score 50 Quality of speech Similarity of voice to the target speaker MOS scale from 1-5 E.Godoy, Voice Conversion December 11, 2012 Subjective Test Results ~Equal success for similarity DFWA yields better quality than other methods No observable differences between acoustic decoding & phonetic classification Converted speech quality consistently higher for DFWA than GMM-based Equal (if not slightly more) success for DFWA in capturing identity No compromise between quality & capturing identity for DFWA! 51 E.Godoy, Voice Conversion December 11, 2012 Conversion Examples Source Target GMM DFWA DFWE slt clb (FF) bdl clb (MF) Target analysis-synthesis with converted spectral envelopes DFWA consistently highest quality GMM-based suffer “loss of presence” Key observation: DFWA can deliver high-quality VC! 52 E.Godoy, Voice Conversion December 11, 2012 VC Perspectives Conversion within gender gives better results (e.g. FF) For inter-gender conversion, spectral envelope not enough Speakers share similar voice characteristics Large pitch modification factors degrade speech Prosody important Need to address phase spectrum & glottal source too… 53 E.Godoy, Voice Conversion December 11, 2012 Today’s Lecture: Voice Conversion Introduction to Voice Conversion I. Speech Synthesis Context (TTS) Overview of Voice Conversion Spectral Envelope Transformation in VC II. III. IV. 54 Typical: Gaussian Mixture Model Proposed: Dynamic Frequency Warping + Amplitude Scaling Conversion Results Objective Metrics & Subjective Evaluations Sound Samples Summary & Conclusions E.Godoy, Voice Conversion December 11, 2012 Summary Text-to-Speech (TTS) 1. 2. Concatenative (unit-selection) Parametric (HMM-based speaker models) Acoustic Parameters in speech 1. 2. Segmental (pitch, spectral envelope) Supra-segmental (prosody, intonation) Voice Conversion 55 Modify speech to transform speaker identity Can create new voices! Focus: spectral envelope related to timbre E.Godoy, Voice Conversion December 11, 2012 Summary (continued) Spectral Envelope Transformation 1. 2. 3. Spectral envelope parameterzation (cepstral, LPC, …) Level of source-target acoustic mappings Method for learning & transformation Standard VC Approach 1. Parallel corpora & source-target frame alignment GMM-based transformation Proposed VC Approach 2. 56 Source-target acoustic mapping on class level Dynamic Frequency Warping + Amplitude Scaling (DFWA) E.Godoy, Voice Conversion December 11, 2012 Important Considerations Spectral Envelope Transformation Speaker identity need to capture average speaker timbre Speech quality need to maintain spectral details 1. 2. Source-Target Acoustic Mappings Contextual information (Phoneme) helpful Mapping features on more global class-level effective 1. 2. Better conversion quality & more flexible (no parallel corpora) (Frame-level alignment/mappings too narrow & restrictive) Speaker characteristics (prosody, voice quality) 57 Success of Voice Conversion can depend on speakers Many features of speech to consider & treat… E.Godoy, Voice Conversion December 11, 2012 Thank You! Questions?