EE E6820: Speech & Audio Processing & Recognition Lecture 5: Speech modeling Dan Ellis <dpwe@ee.columbia.edu> Michael Mandel <mim@ee.columbia.edu> Columbia University Dept. of Electrical Engineering http://www.ee.columbia.edu/∼dpwe/e6820 February 19, 2009 1 2 3 4 5 Modeling speech signals Spectral and cepstral models Linear predictive models (LPC) Other signal models Speech synthesis E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 1 / 46 Outline 1 Modeling speech signals 2 Spectral and cepstral models 3 Linear predictive models (LPC) 4 Other signal models 5 Speech synthesis E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 2 / 46 The speech signal w e has a ^ hε z tcl c θ watch I n I thin z I d ay as a m dime Elements of the speech signal spectral resonances (formants, moving) periodic excitation (voicing, pitched) + pitch contour noise excitation transients (stop-release bursts) amplitude modulation (nasals, approximants) timing! E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 3 / 46 The source-filter model Notional separation of source: excitation, fine time-frequency structure filter: resonance, broad spectral structure Formants Pitch t Glottal pulse train Voiced/ unvoiced Vocal tract resonances + Radiation characteristic Speech f Frication noise t Source Filter More a modeling approach than a single model E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 4 / 46 Signal modeling Signal models are a kind of representation I I I to make some aspect explicit for efficiency for flexibility Nature of model depends on goal I I I classification: remove irrelevant details coding/transmission: remove perceptual irrelevance modification: isolate control parameters But commonalities emerge I I I perceptually irrelevant detail (coding) will also be irrelevant for classification modification domain will usually reflect ‘independent’ perceptual attributes getting at the abstract information in the signal E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 5 / 46 Different influences for signal models Receiver I see how signal is treated by listeners → cochlea-style filterbank models . . . Transmitter (source) I physical vocal apparatus can generate only a limited range of signals . . . → LPC models of vocal tract resonances Making explicit particular aspects I compact, separable correlates of resonances → cepstrum I modeling prominent features of NB spectrogram I addressing unnaturalness in synthesis → sinusoid models → Harmonic+noise model E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 6 / 46 Application of (speech) signal models Classification / matching Goal: highlight important information I I I I speech recognition (lexical content) speaker recognition (identity or class) other signal classification content-based retrieval Coding / transmission / storage Goal: represent just enough information I I real-time transmission, e.g. mobile phones archive storage, e.g. voicemail Modification / synthesis Goal: change certain parts independently I I speech synthesis / text-to-speech (change the words) speech transformation / disguise (change the speaker) E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 7 / 46 Outline 1 Modeling speech signals 2 Spectral and cepstral models 3 Linear predictive models (LPC) 4 Other signal models 5 Speech synthesis E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 8 / 46 Spectral and cepstral models Spectrogram seems like a good representation I I I long history satisfying in use experts can ‘read’ the speech What is the information? I I intensity in time-frequency cells typically 5ms × 200 Hz × 50 dB → Discarded detail: I I phase fine-scale timing The starting point for other representations E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 9 / 46 Short-time Fourier transform (STFT) as filterbank View spectrogram rows as coming from separate bandpass filters f sound Mathematically: X [k, n0 ] = X = X n 2πk(n − n0 ) x[n]w [n − n0 ] exp −j N x[n]hk [n0 − n] n hk[n] where hk [n] = w [−n] exp j 2πkn N Hk(ejω) w[-n] W(ej(ω − 2πk/N)) n 2πk/N E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 ω 10 / 46 Spectral models: which bandpass filters? Constant bandwidth? (analog / FFT) But: cochlea physiology & critical bandwidths → implement ear models with bandpass filters & choose bandwidths by e.g. CB estimates Auditory frequency scales I constant ‘Q’ (center freq / bandwidth), mel, Bark, . . . E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 11 / 46 Gammatone filterbank Given bandwidths, which filter shapes? I I I match inferred temporal integration window match inferred spectral shape (sharp high-freq slope) keep it simple (since it’s only approximate) → Gammatone filters I I 2N poles, 2 zeros, low complexity reasonable linear match to cochlea h[n] = nN−1 e −bn cos(ωi n) 0 2 -10 2 2 2 mag / dB z plane -20 -30 -40 -50 E6820 (Ellis & Mandel) time → 50 100 200 L5: Speech modeling 500 1000 2000 5000 freq / Hz February 19, 2009 12 / 46 Constant-BW vs. cochlea model Frequency responses Spectrograms Effective FFT filterbank 0 freq / Hz Gain / dB -10 -20 -30 -40 -50 0 1000 2000 3000 4000 5000 6000 7000 2000 0.5 1 1.5 2 2.5 3 Q=4 4 pole 2 zero cochlea model downsampled @ 64 5000 2000 freq / Hz Gain / dB 4000 0 0 8000 -10 -20 -30 -40 -50 6000 Gammatone filterbank 0 FFT-based WB spectrogram (N=128) 8000 1000 500 200 100 0 1000 2000 3000 4000 5000 6000 7000 8000 Freq / Hz 0 0.5 1 1.5 2 2.5 3 time / s Magnitude smoothed over 5-20 ms time window E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 13 / 46 Limitations of spectral models Not much data thrown away I I I just fine phase / time structure (smoothing) little actual ‘modeling’ still a large representation Little separation of features e.g. formants and pitch Highly correlated features I modifications affect multiple parameters But, quite easy to reconstruct I iterative reconstruction of lost phase E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 14 / 46 The cepstrum Original motivation: assume a source-filter model: Excitation source g[n] Resonance filter H(ejω) n ω n Define ‘Homomorphic deconvolution’: source-filter convolution g [n] ∗ h[n] FT → product G (e jω )H(e jω ) log → sum log G (e jω ) + log H(e jω ) IFT → separate fine structure cg [n] + ch [n] = deconvolution Definition Real cepstrum cn = idft (log |dft(x[n])|) E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 15 / 46 Stages in cepstral deconvolution Original waveform has excitation fine structure convolved with resonances DFT shows harmonics modulated by resonances Waveform and min. phase IR 0.2 0 -0.2 20 Log DFT is sum of harmonic ‘comb’ and resonant bumps 10 IDFT separates out resonant bumps (low quefrency) and regular, fine structure (‘pitch pulse’) dB 0 Selecting low-n cepstrum separates resonance information (deconvolution / ‘liftering’) 0 0 L5: Speech modeling 100 200 300 400 abs(dft) and liftered 1000 2000 3000 log(abs(dft)) and liftered samps freq / Hz -20 -40 0 200 1000 2000 3000 real cepstrum and lifter freq / Hz pitch pulse 100 0 0 E6820 (Ellis & Mandel) 0 100 200 quefrency February 19, 2009 16 / 46 Properties of the cepstrum Separate source (fine) from filter (broad structure) I smooth the log magnitude spectrum to get resonances Smoothing spectrum is filtering along frequency i.e. convolution applied in Fourier domain → multiplication in IFT (‘liftering’) Periodicity in time → harmonics in spectrum → ‘pitch pulse’ in high-n cepstrum Low-n cepstral coefficients are DCT of broad filter / resonance shape Z dω cn = log X (e jω ) (cos nω + j sin nω) 5th order Cepstral reconstruction Cepstral coefs 0..5 0.1 2 1 0 0 -1 0 1 2 3 4 5 -0.1 0 E6820 (Ellis & Mandel) 1000 2000 3000 L5: Speech modeling 4000 5000 6000 7000 February 19, 2009 17 / 46 Aside: correlation of elements Cepstrum is popular in speech recognition I feature vector elements are decorrelated Cepstral coefficients Auditory spectrum Features 20 20 16 15 12 10 8 5 4 Example joint distrib (10,15) -2 -3 -4 -5 20 18 16 14 12 10 8 6 4 2 16 12 8 4 50 I Covariance matrix 25 100 150 frames 5 10 15 20 3 2 1 0 -1 -2 -3 -4 -5 0 5 c0 ‘normalizes out’ average log energy Decorrelated pdfs fit diagonal Gaussians I simple correlation is a waste of parameters DCT is close to PCA for (mel) spectra? E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 18 / 46 Outline 1 Modeling speech signals 2 Spectral and cepstral models 3 Linear predictive models (LPC) 4 Other signal models 5 Speech synthesis E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 19 / 46 Linear predictive modeling (LPC) LPC is a very successful speech model I I I it is mathematically efficient (IIR filters) it is remarkably accurate for voice (fits source-filter distinction) it has a satisfying physical interpretation (resonances) Basic math I model output as linear function of prior outputs: ! p X s[n] = ak s[n − k] + e[n] k=1 I . . . hence “linear prediction” (p th order) e[n] is excitation (input), AKA prediction error ⇒ 1 S(z) 1 Pp = = −k E (z) A(z) 1 − k=1 ak z . . . all-pole modeling, ‘autoregression’ (AR) model E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 20 / 46 Vocal tract motivation for LPC Direct expression of source-filter model ! p X s[n] = ak s[n − k] + e[n] k=1 Pulse/noise excitation e[n] Vocal tract H(z) = 1/A(z) s[n] |H(ejω)| H(z) f z-plane Acoustic tube models suggest all-pole model for vocal tract Relatively slowly-changing I update A(z) every 10-20 ms Not perfect: Nasals introduce zeros E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 21 / 46 Estimating LPC parameters Minimize short-time squared prediction error E= m X 2 e [n] = X s[n] − n n=1 p X !2 ak s[n − k] k=1 Differentiate w.r.t. ak to get equations for each k: p X X 0= 2 s[n] − aj s[n − j] (−s[n − k]) n X s[n]s[n − k] = n φ(0, k) = j=1 X aj X s[n − j]s[n − k] j n X aj φ(j, k) j where φ(j, k) = coefficients I Pm n=1 s[n − j]s[n − k] are correlation p linear equations to solve for all aj s . . . E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 22 / 46 Evaluating parameters P Linear equations φ(0, k) = pj=1 aj φ(j, k) If s[n] is assumed to be zero outside of some window X φ(j, k) = s[n − j]s[n − k] = rss (|j − k|) n I rss (τ ) is autocorrelation Hence equations become: r (1) r (0) r (1) ··· r (2) r (1) r (2) ··· .. = .. .. .. . . . . r (p) r (p − 1) r (p − 2) · · · r (p − 1) r (p − 2) .. . r (0) a1 a2 .. . ap Toeplitz matrix (equal antidiagonals) → can use Durbin recursion to solve (Solve full φ(j, k) via Cholesky) E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 23 / 46 LPC illustration 0.1 0 windowed original -0.1 -0.2 LPC residual -0.3 0 dB 50 100 150 200 250 300 350 400 time / samp original spectrum 0 LPC spectrum -20 -40 -60 residual spectrum 0 1000 2000 3000 4000 5000 6000 7000 freq / Hz Actual poles z-plane E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 24 / 46 Interpreting LPC Picking out resonances I if signal really was source + all-pole resonances, LPC should find the resonances Least-squares fit to spectrum minimizing e 2 [n] in time domain is the same as minimizing E 2 (e jω ) by Parseval → close fit to spectral peaks; valleys don’t matter I Removing smooth variation in spectrum 1 A(z) S(z) I E (z) I I is a low-order approximation to S(z) 1 = A(z) hence, residual E (z) = A(z)S(z) is a ‘flat’ version of S Signal whitening: white noise (independent x[n]s) has flat spectrum → whitening removes temporal correlation I E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 25 / 46 Alternative LPC representations Many alternate p-dimensional representations I I I I I coefficients {a P Qj } 1 − λj z −j = 1 − aj z −1 roots {λj }: line spectrum frequencies. . . reflection coefficients {kj } from lattice form tube model log area ratios gj = log 1−kj 1+kj Choice depends on: I I I I I mathematical convenience / complexity quantization sensitivity ease of guaranteeing stability what is made explicit distributions as statistics E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 26 / 46 LPC applications Analysis-synthesis (coding, transmission) (z) S(z) = EA(z) hence can reconstruct by filtering e[n] with {aj }s I whitened, decorrelated, minimized e[n]s are easy to quantize . . . or can model e[n] e.g. as simple pulse train I Recognition / classification I I LPC fit responds to spectral peaks (formants) can use for recognition (convert to cepstra?) Modification I I separating source and filter supports cross-synthesis pole / resonance model supports ‘warping’ e.g. male → female E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 27 / 46 Aside: Formant tracking Formants carry (most?) linguistic information Why not classify → speech recognition? e.g. local maxima in cepstral-liftered spectrum pole frequencies in LPC fit But: recognition needs to work in all circumstances I formants can be obscured or undefined Original (mpgr1_sx419) freq / Hz 4000 3000 2000 1000 0 Noise-excited LPC resynthesis with pole freqs freq / Hz 4000 3000 2000 1000 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 time / s → need more graceful, robust parameters E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 28 / 46 Outline 1 Modeling speech signals 2 Spectral and cepstral models 3 Linear predictive models (LPC) 4 Other signal models 5 Speech synthesis E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 29 / 46 Sinusoid modeling Early signal models required low complexity e.g. LPC Advances in hardware open new possibilities. . . freq / Hz NB spectrogram suggests harmonics model 4000 3000 2000 1000 0 I I I 0 0.5 1 time / s 1.5 ‘important’ info in 2D surface is set of tracks? harmonic tracks have ∼smooth properties straightforward resynthesis E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 30 / 46 Sine wave models Model sound as sum of AM/FM sinusoids N[n] s[n] = X Ak [n] cos(n ωk [n] + φk [n]) k=1 I I Ak , ωk , φk piecewise linear or constant can enforce harmonicity: ωk = kω0 Extract parameters directly from STFT frames: time mag freq I I find local maxima of |S[k, n]| along frequency track birth/death and correspondence E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 31 / 46 Finding sinusoid peaks Look for local maxima along DFT frame i.e. |s[k − 1, n]| < |S[k, n]| > |S[k + 1, n]| Want exact frequency of implied sinusoid DFT is normally quantized quite coarsely e.g. 4000 Hz / 256 bands = 15.6 Hz/band I quadratic fit to 3 points magnitude interpolated frequency and magnitude spectral samples frequency I may also need interpolated unwrapped phase Or, use differential of phase along time (pvoc): ω= E6820 (Ellis & Mandel) aḃ − b ȧ a2 + b 2 where S[k, n] = a + jb L5: Speech modeling February 19, 2009 32 / 46 Sinewave modeling applications Modification (interpolation) and synthesis I connecting arbitrary ω and φ requires cubic phase interpolation (because ω = φ̇) Types of modification I time and frequency scale modification . . . with or without changing formant envelope I I concatenation / smoothing boundaries phase realignment (for crest reduction) freq / Hz Non-harmonic signals? OK-ish 4000 3000 2000 1000 0 E6820 (Ellis & Mandel) 0 0.5 L5: Speech modeling 1 time / s 1.5 February 19, 2009 33 / 46 Harmonics + noise model Motivation to improve sinusoid model because I I I problems with analysis of real (noisy) signals problems with synthesis quality (esp. noise) perceptual suspicions Model N[n] s[n] = X k=1 I I Ak [n] cos(nkω0 [n]) + e[n](hn [n] ∗ b[n]) | {z } | {z } Harmonics Noise sinusoids are forced to be harmonic remainder is filtered and time-shaped noise ‘Break frequency’ Fm [n] between H and N dB 40 Harmonics Harmonicity limit Fm[n] Noise 20 0 0 E6820 (Ellis & Mandel) 1000 2000 L5: Speech modeling 3000 freq / Hz February 19, 2009 34 / 46 HNM analysis and synthesis freq / Hz Dynamically adjust Fm [n] based on ‘harmonic test’: 4000 3000 2000 1000 0 0 0.5 1 time / s 1.5 Noise has envelopes in time e[n] and frequency Hn 0 0 1000 dB 40 2000 Hn[k] freq / Hz 3000 e[n] 0 0.01 0.02 0.03 time / s reconstruct bursts / synchronize to pitch pulses E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 35 / 46 Outline 1 Modeling speech signals 2 Spectral and cepstral models 3 Linear predictive models (LPC) 4 Other signal models 5 Speech synthesis E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 36 / 46 Speech synthesis One thing you can do with models Synthesis easier than recognition? I listeners do the work . . . but listeners are very critical Overview of synthesis text Phoneme generation Text normalization Synthesis algorithm speech Prosody generation normalization disambiguates text (abbreviations) phonetic realization from pronunciation dictionary I prosodic synthesis by rule (timing, pitch contour) . . . all control waveform generation I I E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 37 / 46 Source-filter synthesis Flexibility of source-filter model is ideal for speech synthesis Pitch info Voiced/ unvoiced t Glottal pulse source Phoneme info th ax k ae t t Vocal tract filter + t Speech Noise source t Excitation source issues voiced / unvoiced / mixture ([th] etc.) pitch cycles of voiced segments glottal pulse shape → voice quality? E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 38 / 46 Vocal tract modeling freq Simplest idea: store a single VT model for each phoneme time th ax k ae t but discontinuities are very unnatural freq Improve by smoothing between templates time th ax k ae t trick is finding the right domain E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 39 / 46 Cepstrum-based synthesis Low-n cepstrum is compact model of target spectrum Can invert to get actual VT IR waveforms: cn = idft(log |dft(x[n])|) ⇒ h[n] = idft(exp(dft(cn ))) All-zero (FIR) VT response → can pre-convolve with glottal pulses Glottal pulse inventory Pitch pulse times (from pitch contour) ee ae time ah I cross-fading between templates OK E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 40 / 46 LPC-based synthesis Very compact representation of target spectra I 3 or 4 pole pairs per template Low-order IIR filter → very efficient synthesis How to interpolate? cannot just interpolate aj in a running filter but lattice filter has better-behaved interpolation + s[n] a1 z-1 a2 z-1 a3 z-1 e[n] + + + kp-1 z-1 - + e[n] z-1 - + I I s[n] -1 k0 z -1 What to use for excitation I I I residual from original analysis reconstructed periodic pulse train parametrized residual resynthesis E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 41 / 46 Diphone synethsis Problems in phone-concatenation synthesis I I I phonemes are context-dependent coarticulation is complex transitions are critical to perception → store transitions instead of just phonemes e hε z w ^ Phones tcl c θ I n I z I d ay m Diphone segments I I ∼ 40 phones ⇒ ∼ 800 diphones or even more context if have larger database How to splice diphones together? I I TD-PSOLA: align pitch pulses and cross fade MBROLA: normalized multiband E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 42 / 46 HNM synthesis High quality resynthesis of real diphone units + parametric representation for modification I I pitch, timing modifications removal of discontinuities at boundaries Synthesis procedure I I I I linguistic processing gives phones, pitch, timing database search gives best-matching units use HNM to fine-tune pitch and timing cross-fade Ak and ω0 parameters at boundaries freq time Careful preparation of database is key I I sine models allow phase alignment of all units larger database improves unit match E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 43 / 46 Generating prosody The real factor limiting speech synthesis? Waveform synthesizers have inputs for I I I intensity (stress) duration (phrasing) fundamental frequency (pitch) Curves produced by superposition of (many) inferred linguistic rules I phrase final lengthening, unstressed shortening, . . . Or learn rules from transcribed elements E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 44 / 46 Summary Range of models I I spectral, cepstral LPC, sinusoid, HNM Range of applications I I I general spectral shape (filterbank) → ASR precise description (LPC + residual) → coding pitch, time modification (HNM) → synthesis Issues I I I performance vs computational complexity generality vs accuracy representation size vs quality Parting thought not all parameters are created equal. . . E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 45 / 46 References Alan V. Oppenheim. Speech analysis-synthesis system based on homomorphic filtering. The Journal of the Acoustical Society of America, 45(1):309–309, 1969. J. Makhoul. Linear prediction: A tutorial review. Proceedings of the IEEE, 63(4): 561–580, 1975. Bishnu S. Atal and Suzanne L. Hanauer. Speech analysis and synthesis by linear prediction of the speech wave. The Journal of the Acoustical Society of America, 50(2B):637–655, 1971. J.E. Markel and AH Gray. Linear Prediction of Speech. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 1982. R. McAulay and T. Quatieri. Speech analysis/synthesis based on a sinusoidal representation. Acoustics, Speech, and Signal Processing [see also IEEE Transactions on Signal Processing], IEEE Transactions on, 34(4):744–754, 1986. Wael Hamza, Ellen Eide, Raimo Bakis, Michael Picheny, and John Pitrelli. The IBM expressive speech synthesis system. In INTERSPEECH, pages 2577–2580, October 2004. E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 46 / 46