ENEE408G Spring 2004 Lecture-2 Digital Speech Processing and Coding Fall’05 Instructor: Carol Espy-Wilson Electrical & Computer Engineering University of Maryland, College Park http://www.ece.umd.edu/class/enee408g/ http://umd.blackbloard.com/ minwu@umd.edu ENEE408G Capstone -- Multimedia Signal Processing (F'05) UMCP ENEE408G Slides (created by M.Wu & R.Liu © 2002) Last Lecture Course overview and logistics Bring multimedia to digital world: sampling & quantization Introduction to speech processing – Different aspects of speech Friday Lab Session – Speech Processing, Coding, Recognition, & HCI Today: speech processing, coding, synthesis ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [2] Speech Production ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [3] Source-Filter View of Speech Production (Stevens 1999) Source Spectrum Vocal tract transfer function Radiation Characteristics Power spectrum of speech signal ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [4] “Sprouted grains and seeds are used in salads and dishes such as chop suey” 6000 4000 F2 2000 0.0 2.0 1.0 3.0 4.0 Time (sec) 8000 Frequency (kHz) Frequency (kHz) UMCP ENEE408G Slides (created by Carol Espy-Wilson © 2004) 8000 6000 4000 2000 0.1 fricative 0.5 glide stop0.3 vowel vowel stop consonant consonant ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [5] UMCP ENEE408G Slides (created by Carol Espy-Wilson © 2004) Phonetic Features (Chomsky & Halle, 1968) There are three kinds of phonetic features – Source features determine the kind of excitation signal – Manner of articulation features determine how open or closed is the vocal tract – Place of articulation features determine the location of primary constriction ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [6] UMCP ENEE408G Slides (created by Carol Espy-Wilson © 2004) Source feature “voiced” +voiced /z/ ENEE408G Capstone -- Multimedia Signal Processing (F'05) -voiced /s/ Lec2 – Introduction 2/4/04 [7] “Sprouted” 8000 Frequency (kHz) UMCP ENEE408G Slides (created by Carol Espy-Wilson © 2004) Source Feature voiced 6000 4000 2000 0.1 turbulence -voiced 0.3 Time (sec) ENEE408G Capstone -- Multimedia Signal Processing (F'05) 0.5 vertical striations +voiced Lec2 – Introduction 2/4/04 [8] Glottal Source (Klatt & Klatt 1990) ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [9] Voice Quality-APP Detector Modal Voice Breathy Voice Creaky Voice ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [10] UMCP ENEE408G Slides (created by Carol Espy-Wilson © 2004) Manner feature “sonorant” +sonorant -sonorant Primary source above the glottis at alveolar ridge vowel Primary source at glottis ENEE408G Capstone -- Multimedia Signal Processing (F'05) /z/ Lec2 – Introduction 2/4/04 [11] UMCP ENEE408G Slides (created by Carol Espy-Wilson © 2004) Source Feature sonorant high frequency energy -sonorant “Sprouted” Frequency (kHz) 8000 6000 4000 2000 0.1 0.3 Time (sec) ENEE408G Capstone -- Multimedia Signal Processing (F'05) 0.5 low frequency energy +sonorant Lec2 – Introduction 2/4/04 [12] UMCP ENEE408G Slides (created by Carol Espy-Wilson © 2004) Place feature for stop consonants +labial /p/ ENEE408G Capstone -- Multimedia Signal Processing (F'05) +alveolar /t/ Lec2 – Introduction 2/4/04 [13] UMCP ENEE408G Slides (created by Carol Espy-Wilson © 2004) Place Feature Labial vs. Alveolar spectral prominence falling labial /b/ dB Frequency (Hz) ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [14] UMCP ENEE408G Slides (created by Carol Espy-Wilson © 2004) Place Feature Labial vs. Alveolar spectral prominence rising falling labial /p/ alveolar /t/ dB Frequency (Hz) ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [15] UMCP ENEE408G Slides (created by M.Wu © 2003) Source-Filter Theory Figure 1 of SPM May’98 Speech Survey First “speaking machine” in 1930s NY World’s Fair – 14 keys, 1 wristband, 1 pedal Modeling speech production as a linear system – Sound sources Either voiced or unvoiced – Voice sound Modeled by a generator of pulses – Unvoiced sound Modeled by white noise generator – Articulation Modeled by a cascade of singleresonance (pole) digital filters ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [16] UMCP ENEE408G Slides (created by R. Liu & M.Wu © 2002) Linear Separable Model for Speech Production Vocal tract is modeled as a linear time-varying system – Parameters of the linear system are slowly varying – Excited by time-varying source (voiced or unvoiced) Practical models – Model each speech frame as Linear Time-Invariant – Excited by either voiced or unvoiced source – Allow overlaps in neighbouring frames Figure 3.2 of Furui’s book ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [17] Speech Coding ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [18] Statistical Properties of Speech UMCP ENEE408G Slides (created by Carol Espy-Wilson © 2004) Digital Speech Processing by Rabiner and Shafer ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [19] Statistical Properties of Speech UMCP ENEE408G Slides (created by Carol Espy-Wilson © 2004) Digital Speech Processing by Rabiner and Shafer Lowpass filtered (0-3400 Hz) Bandpass filtered (200-3400 Hz) ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [20] Statistical Properties of Speech UMCP ENEE408G Slides (created by Carol Espy-Wilson © 2004) Digital Speech Processing by Rabiner and Shafer ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [21] UMCP ENEE408G Slides (created by Carol Espy-Wilson © 2004) Digital Coding of Speech Information Capacity I=Bfs source coding waveform coding kbps 64 200 broadcast quality 16 toll quality 9.6 7.2 4.8 commun. quality Synthetic quality 0.05 Waveform coders: quantize speech samples directly at high bit rates. Source coders (vocoders): use knowledge of speech production to parameterize the signal (model based) Hybrid coders: partly waveform based and partly model based (2.4-16 kbps) ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [22] UMCP ENEE408G Slides (created by M.Wu & R.Liu © 2002) PCM coding I(x,y) Sampler Quantizer Encoder transmit Input signal digitize/capture device How to encode a signal into bits? – Sampling and perform uniform quantization (2 parameters: , equal quantization step size and B, # of bits) “Pulse Coded Modulation” (PCM) 8 bits per sample ~ good for speech 16 bits ~ needed for high-quality music Tradeoff between fidelity and file size How to “squeeze” out redundancy? ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [23] UMCP ENEE408G Slides (created by Carol Espy-Wilson © 2004) Discussion on Improving PCM (1) 2 parameters: step size , # of bits B Peak-to-peak range is 2Xmax, 2 X max 2B Assume e[n] x[n] xˆ[n] – where e[n] is uncorrelated with x[n], and it is uniformly 1 distributed pe[e] 2 e 12 2 x2 3 22 B SNR 2 e [ X max x ]2 2 X max SNR(dB) 6 B 4.77 20log[ ] x ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [25] Uniform quantization UMCP ENEE408G Slides (created by Carol Espy-Wilson © 2004) Digital Speech Processing by Rabiner and Shafer ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [26] UMCP ENEE408G Slides (created by M.Wu & R.Liu © 2002) Discussion on Improving PCM (1) Uniform quantization may give inconsistent range of relative amount errors – E.g., +/- 2 incurs 20% vs. 2% at amplitude 10 and 100 Non-uniform quantization – Assign smaller quantization step size at small amplitude to maintain consistent range of relative quantization errors over the entire dynamic range – Can apply non-linear transform before uniform quantization via “companding” (compression-expansion) -law companding: international standard for 64kbps speech ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [27] UMCP ENEE408G Slides (created by Carol Espy-Wilson © 2004) Discussion on Improving PCM (1) y[n] ln | x[n] | x[n] e ( y[ n ]) sign( x[n]) 1 x[n] 0 sign( x[n]) 1 x[n] 0 yˆ[n] ln | x[n] | [n] ˆx[n] e( yˆ [ n ]) sign ( x[ n ]) xˆ[n] | x[n] | sign( x[n])e [ n ] x[n]e [ n ] xˆ[n] x[n](1 [n]) x[n] x[n] [n] xˆ[n] x[n] e[n] ENEE408G Capstone -- Multimedia Signal Processing (F'05) x2 1 SNR 2 2 2 x e e Lec2 – Introduction 2/4/04 [28] Discussion on Improving PCM (1) UMCP ENEE408G Slides (created by Carol Espy-Wilson © 2004) Digital Speech Processing by Rabiner and Shafer But, ln[0] y[n] X max not practical | x[n] | log[1 ] X max sign( x[n]) log[1 ] ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [29] UMCP ENEE408G Slides (created by Carol Espy-Wilson © 2004) Discussion on Improving PCM (1) Log Companding Digital Speech Processing by Rabiner and Shafer ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [30] UMCP ENEE408G Slides (created by M.Wu & R.Liu © 2002) Discussion on Improving PCM (2) Quantized PCM values may not be equally likely – Can we do better than encode each value using same # bits? Example – P(“0” ) = 0.5, P(“1”) = 0.25, P(“2”) = 0.125, P(“3”) = 0.125 – If use same # bits for all values Need 2 bits to represent the four possibilities if treat equally – If use less bits for likely values “0” ~ Variable Length Codes (VLC) “0” => [0], “1” => [10], “2” => [110], “3” => [111] Use 1.75 bits on average ~ saves 0.25 bit per sample! Bring probability into the picture – Use probability distribution to reduce average # bits per quantized sample ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [31] UMCP ENEE408G Slides (created by M.Wu & R.Liu © 2002) How to Encode Correlated Sequence? Consider: high correlation between successive samples Predictive coding – Basic principle: Remove redundancy between successive pixels and only encode residual between actual and predicted – Residue usually has much smaller dynamic range Allow fewer quantization levels for the same MSE => get compression – Compression efficiency depends on intersample redundancy u(n) First try e(n) _ Predictor eQ(n) Quantizer Encoder u’P(n) = u(n-1) uQ (n) eQ(n) + uP(n) = uQ(n-1) Predictor ENEE408G Capstone -- Multimedia Signal Processing (F'05) Decoder Lec2 – Introduction 2/4/04 [40] Encoder Predictive Coding (cont’d) UMCP ENEE408G Slides (created by M.Wu & R.Liu © 2002) u(n) Problem with 1st try eQ(n) uQ(n) + decoder doesn’t know u(n)! – Mismatch error could propagate to future reconstructed samples Quantizer _ – Input to predictor are different at encoder and decoder e(n) Predictor uP(n) =uQ(n-1) uQ (n) eQ(n) Solution: Differential PCM (DPCM) – Use quantized sequence uQ(n) for prediction at both encoder and decoder + uP(n) = uQ(n-1) Predictor Decoder – Prediction error e(n) – Quantized prediction error eQ(n) – Distortion d(n) = e(n) – eQ(n) ENEE408G Capstone -- Multimedia Signal Processing (F'05) Think: what predictor to use? Lec2 – Introduction 2/4/04 [41] UMCP ENEE408G Slides (created by R. Liu & M.Wu © 2002) Linear Prediction Analysis of Speech {ai } are called Linear Prediction Coefficients (LPC) + s[n] e[n] + e[n] _ z 1 + + s[ n] _ z 1 a1 a1 z 1 z 1 a2 a2 z 1 z 1 aP aP Analysis Error Minimization Normal equations Synthesis min E E (e2 [n]) E ( s[n] sˆ[n]) {ak } n 2 n Saˆ s Can be solved using the famous Levinson Recursion, which leads to lattice formulation of the linear prediction solution ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [43] Source-Filter View of Speech Production e(t) E() v(t) r(t) s(t) V() R() S() s(t) = e(t)*v(t)*r(t) S() = E()V()R() ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [44] UMCP ENEE408G Slides (created by R. Liu & M.Wu © 2002) All-Pole Modeling of Speech Auto-regressive (AR) model: all-pole filter H ( z ) G ( z )V ( z ) R( z ) P 1 ak z k A( z ) k 1 – H(z) is the overall transfer function – Glottal Flow G(z), Vocal Tract V(z), Radiation R(z), Gain Synthesis process: u[n]: the vocal tract input, s[n]: speech output u [n] H ( z) A( z ) ENEE408G Capstone -- Multimedia Signal Processing (F'05) s[n] Lec2 – Introduction 2/4/04 [45] UMCP ENEE408G Slides (created by R. Liu & M.Wu © 2002) All-Pole Model and Linear Prediction S ( z) U ( z ) A( z ) P P 1 ak z k S ( z ) ak S ( z ) z k U ( z ) k 1 k 1 s[ n] P a k k 1 Here sˆ[ n] s[ n k ] u[ n] P a s[u k ] k 1 s[ n] k is a linear prediction of order P for s[n] _ P( z ) e[n] + sˆ[ n ] where sˆ[ n] e[ n] + e[n] s[n] sˆ[n] is the prediction error sequence ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [46] UMCP ENEE408G Slides (created by R. Liu & M.Wu © 2002) Model-based Coding Linear Prediction Coder (LPC) – LPC Vocoder ( voice coder ) Divide speech into frames (several tens milliseconds) and encode the LPC coefficients of each frame Additional parameters to facilitate synthesis: voiced/unvoiced flag, gain, pitch (for voiced) – Line Spectrum Pair (LSP) Coding Hybrid Coding: LPC Residual Coding – Between LPC and waveform coding ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [47] UMCP ENEE408G Slides (created by R. Liu & M.Wu © 2002) Line Spectrum Pair (LSP) Coding Pros and Cons of LPC method – Good performance at coding rate down to 2.4kbps – Synthesized voice becomes unnatural when below 2.4kbps – When the poles are near the unit circle, quantization in LPC coefficients may result in instability. LSP parameters – LSP are frequencies extracted from polynomials constructed from LPC coefficients – Frequency domain features (similar to formant) => produce less distortion due to quantization [See details in Design Project on Speech] ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [48] UMCP ENEE408G Slides (created by R. Liu & M.Wu © 2002) Hybrid Coding “Hybrid” – between LPC and waveform coding – LPC Residual Coding: encode and slowly update LPC coefficients, and send the LPC residual (e.g. encoded using Vector Quantization) Advantages: – Free from quality degradation due to source modeling – Low-frequency waveform is exactly reproduced – Spectral information of the entire frequency range is preserved – No need of pitch period estimation and voiced/unvoiced decision ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [49] UMCP ENEE408G Slides (created by R. Liu & M.Wu © 2002) Code-Excited Linear Predictive Coding (CELP) Multipulse-Excised Linear Predictive Coding (MPC) – Do not distinguish voiced/unvoiced sound explicitly Code-Excited Linear Predictive Coding (CELP) – Replace the multi-pulses of MPC with vector-quantized sequences based on long-term prediction of periodicity and short-term prediction Figure 6.32 of Furui’s book ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [52] Table 6.1 of Furui’s book UMCP ENEE408G Slides (created by R. Liu & M.Wu © 2002) Speech Coding Methods – Waveform coding; Hybrid coding; Analysis-synthesis coding ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [53] UMCP ENEE408G Slides (created by R. Liu & M.Wu © 2002) Speech Quality vs. Transmission Rate Figure 6.2 of Furui’s book ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [54] UMCP ENEE408G Slides (created by R. Liu & M.Wu © 2002) Comparison of Different Speech Coding Tech. Table 6.2 of Furui’s book ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [55] UMCP ENEE408G Slides (created by M.Wu & R.Liu © 2002) Put Together: A Digital Telephone System – 8kHz and 8-bit per sample for telephone speech => 64kbps – Anti-aliasing filter before sampling – Non-uniform quantization (e.g., through -law or A-law companding ~ signal compression-expansion) ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [56] UMCP ENEE408G Slides (created by M.Wu & R.Liu © 2002) Speech Synthesis ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [57] Figure 7.2 of Furui’s book UMCP ENEE408G Slides (created by R. Liu & M.Wu © 2002) Speech Synthesis Speech synthesis: a process that artificially produces speech – Articulatory synthesis, Formant synthesis, and LPC synthesis – Issues other than synthesizer structure: text analysis, etc. ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [58] Table 7.1 of Furui’s book UMCP ENEE408G Slides (created by R.Liu © 2002) Comparison of Synthesis Methods ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [59] Figure 7.8 of Furui’s book UMCP ENEE408G Slides (created by R. Liu & M.Wu © 2002) Text-to-Speech Conversion System => See more in Design Project and try it out ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [60] Analysis/Synthesis Naturally spoken utterance ENEE408G Capstone -- Multimedia Signal Processing (F'05) Synthesized utterance Lec2 – Introduction 2/4/04 [61] UMCP ENEE408G Slides (created by R. Liu & M.Wu © 2002) Human Computer Interface/Interaction (HCI) Multi-modal multimedia communications and interactions – Info. & interface through speech/audio, image/video, graphics, etc. Building blocks for speech based HCI – Speech recognition and speaker identification – Natural language understanding – (Speech synthesis) – Examples voice command, dictation Question-and-Answer: for intelligent customer service, voicebased info. retrieval, call routing, …… Enhance speech-based HCI with graphics: “talking head” => See more in Design Project and try it out ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [62] UMCP ENEE408G Slides (created by M.Wu & R.Liu © 2002) Summary Speech production and analysis – Spectrogram; Pitch, Formant – Linear prediction model Speech coding – Basic compression tools Speech Synthesis This week’s Lab session: – Design project#1 on Speech Next lecture: speech recognition ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [63] UMCP ENEE408G Slides (created by M.Wu & R.Liu © 2002) Assignments “The Past, Present, and Future of Speech Processing” “Talk to the Machine” ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [64]