Speech Coding II. Jan Černocký, Valentina Hubeika {cernocky|ihubeika}@fit.vutbr.cz DCGM FIT BUT Brno Speech Coding II Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 1/29 Agenda • Tricks used in coding – long-term predictor, analysis by synthesis, perceptual filter. • GSM full rate – RPE-LTP • CELP • GSM enhanced full rate – ACELP Speech Coding II Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 2/29 EXCITATION CODING • LPC reaches a low bit-rate but the quality is very poor (“robot” sound). • caused by primitive coding of excitation (only voiced/unvoiced), whilst people have both components (excitation and modification) varying over time. • in modern (hybrid) coders a lot of attention is paid to coding of excitation. • following slides: tips in excitation coding. Speech Coding II Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 3/29 Long-Term Predictor (LTP) The LPC error signal attributes behaviour of noise, however only within short time intervals. For voiced sounds in longer time span, the signal is correlated in periods of the fundamental frequency: 2 x 4 10 1.5 1 0.5 0 −0.5 −1 −1.5 0 20 40 60 80 samples 100 120 140 160 8000 6000 4000 2000 0 −2000 −4000 −6000 0 Speech Coding II 20 40 60 80 samples 100 Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 120 140 160 4/29 ⇒ long-term predictor (LTP) with the transfer function −bz −L . Predicts the sample s(n) from the preceding sample s(n − L) (L is the pitch period in samples – lag). Error signal: e(n) = s(n) − ŝ(n) = s(n) − [−bs(n − L)] = s(n) + bs(n − L), thus B(z) = 1 + bz −L . s(n) A(z) Speech Coding II B(z) e(n) Q ^e(n) Q -1 1/B(z) Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 1/A(z) ^s(n) 5/29 Analysis by Synthesis In most coders, the optimal excitation (after removing short-term and long-term correlations) cannot be found analytically. Thus all possible excitations have to be generated, the output (decoded) speech signal produced and compared to the original input signal. The excitation producing minimal error wins. The method is called closed-loop. To distinguish between the excitation e(n) and the error signal between the generated output and the input signals, the later (error signal) is designated as ch(n). s(n) ^ excitation e(n) G encoding 1/B(z) 1/A(z) ^s(n) - + Σ ch(n) miniumum search Speech Coding II energy computation Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno W(z) 6/29 Perceptual filter W (z) In the closed-loop method, the synthesized signals is compared to the input signal. PF – modifies the error signal according to human hearing. Comparison of the error signal spectrum to the smoothed speech spectrum: 20 4 S(f) CH(f) 2 15 0 log−frequency response log−spectrum 10 5 −2 −4 −6 0 −8 −5 −10 W(z) −10 0 500 1000 Speech Coding II 1500 2000 frequency 2500 3000 3500 4000 −12 0 500 1000 Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 1500 2000 frequency 2500 3000 3500 4000 7/29 in formant regions, speech is much “stronger” than the error ⇒ the error is masked. In “valleys” between formants, there is an opposite situation: for example in the 1500-2000 H band, the error is stronger than speech - leads to an audible artifacts. ⇒ a filter that attenuates error signal in formant (not important) regions and amplify it in “sensitive” regions. A(z) , where γ ∈ [0.8, 0.9] W (z) = A(z/γ) Speech Coding II Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno (1) 8/29 How does it work ? • the nominator A(z) (which is actually an “inverse LPC filter”) has the characteristic inverse to the smoothed speech spectrum. 1 1 • filter A(z/γ) has a similar characteristic as A(z) (smoothed speech spectrum), but as the poles are placed close to the origin of the unit circle, the picks of the filter are not sharp. • multiplication of these components results in TF W (f ) with the required characteristic. 10 4 2 5 0 log−frequency response log−frequency response 0 −5 −2 −4 −6 −10 −8 −15 −10 A(z) 1/A(z/γ) −20 0 500 Speech Coding II 1000 1500 2000 frequency 2500 3000 3500 4000 W(z) −12 0 500 1000 Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 1500 2000 frequency 2500 3000 3500 4000 9/29 poles of the transfer function: 1 A(z) and 1 A(z/γ) : 90 1/A(z) 1/A(z/γ) 1 60 120 0.8 0.6 30 150 0.4 0.2 180 0 210 330 240 300 270 Speech Coding II Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 10/29 Excitation Coding in Shorter Frames While LP analysis is done over a frame with a “standard” length (usually 20 ms – 160 samples for Fs = 8 kHz), excitation is often coded in shorter frames – typically 40 samples. Speech Coding II Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 11/29 ENCODER I: RPE-LTP Reflection coefficients coded as Log. - Area Ratios (36 bits/20 ms) Short term LPC analysis Input signal Preprocessing Short term analysis filter (1) (2) + RPE parameters (47 bits/5 ms) RPE grid selection and coding (3) Long term analysis filter (4) (5) + LTP analysis (1) Short term residual (2) Long term residual (40 samples) (3) Short term residual estimate (40 samples) (4) Reconstructed short term residual (40 samples) (5) Quantized long term residual (40 samples) Speech Coding II Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno RPE grid decoding and positioning LTP parameters (9 bits/5 ms) To radio subsystem 12/29 • Regular-Pulse Excitation, Long Term Prediction, GSM full-rate ETSI 06.10 • Short-term analysis (20 ms frames with 160 samples), the coefficients of a LPC filter are converted to 8 LAR. • Long-time prediction (LTP - in frames of 5 ms 40 samples) - lag a gain. • Excitation is coded in frames of 40 samples, where the error signal is down-sampled with the factor 3 (14,13,13), and only the position of the first impulse is quantized (0,1,2,3(!)) • The sizes of the single impulses are quantized using APCM. ^e(n) e(3) e(1) e(6) e(4) 2 0 n e(2) e(5) • The result is a frame on 260 bits × 50 = 13 kbit/s. • For more information see the norm 06.10, available for downloading at http://pda.etsi.org Speech Coding II Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 13/29 Decoder RPE-LTP encoder Reflection coefficients coded as Log. - Area Ratios (36 bits/20 ms) RPE grid decoding and positioning RPE parameters (47 bits/5 ms) Short term synthesis filter + Postprocessing Output signal Long term synthesis filter LTP parameters (9 bits/5 ms) From radio subsystem Figure 1.2: Simplified block diagram of the RPE - LTP decoder Speech Coding II Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 14/29 CELP – Codebook-Excited Linear Prediction Excitation is coded using code-books Basic structure with a perceptual filter is: ... excitation codebook G s(n) ^s(n) + 1/A(z) − Σ 1/B(z) q(n) miniumum search energy computation . . . Each tested signal has to be filtered by W (z) = Speech Coding II A(z) A⋆ (z) W(z) — too much work! Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 15/29 Perceptual filter on the Input Filtering is a linear operation, thus W (z) can be moved to both sides: to the input and to A(z) 1 1 1 1 the signal after the sequence B(z) – A(z) . Can be simplified to: A(z) A⋆ (z) = A⋆ (z) — this is a new filter that has to be used after 1 B(z) . s(n) excitation codebook ... W(z) p(n) e(n) x(n) + ^p(n) G 1/B(z) 1/A*(z) - Σ miniumum search Speech Coding II energy computation Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 16/29 Null Input Response Reduction Filter A⋆1(z) responds not only to excitation but also has memory (delayed samples) Denote the impulse response as h(i): p̂(n) = n−1 X h(i)e(n − i) + i=0 | ∞ X h(i)e(n − i) (2) i=n {z } tento rámec | {z minulé rámce } p̂0 (n) Response from the previous samples does not depend on the current input – the response can be calculated only once and then subtracted from the examined signal (input filtered by W (z)), which results in : p̂(n) − p̂0 (n) = n−1 X h(i)e(n − i) (3) i=0 Further approach: consider long-term predictor: 1 1 , = B(z) 1 − bz −M Speech Coding II Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno (4) 17/29 where M is the optimal lag. e(n) can be defined as e(n) = x(n) + be(n − M ), (5) in the filter equation thus instead of e(n − i) we get: p̂(n) − p̂0 (n) = n−1 X h(i)[x(n − i) + be(n − M − i)]. (6) i=0 then expand: p̂(n) − p̂0 (n) = n−1 X i=0 h(i)x(n − i) + b n−1 X h(i)e(n − M − i), (7) i=0 because filtering is a linear operation. The second expression represents past excitation filtered by A⋆1(z) and multiplied by LTP gain b. The LTP in the CELP coder can be interpreted as another code-book containing past excitation. Here, it is considered as an actual code-book with rows e(n − M ) (or e(M ) in vector notation). In real applications, it is just a delayed exciting signal. Speech Coding II Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 18/29 CELP Equations p̂(n)−p̂0 (n) = n−1 X i=0 h(i)e(n−i) = n−1 X i=0 h i (j) (j) h(i) g u (n − i) + be(n − i − M ) = p̂2 (n)+p̂1 (n), (8) Task is to find: • gain for the stochastic code-book g • the best vector from the stochastic code-book u(j) • gain of the adaptive code-book b • the best vector from the best adaptive code-book e(M ) . Theoretically, all the combinations have to be tested ⇒ no ! Suboptimal procedure: • first find, p̂1 (n) by searching in the adaptive code-book. The result is lag M and gain b. • Subtract p̂1 (n) from p1 (n) and get “an error signal of the second generation”, which should be provided by the stochastic code-book. • Find p̂2 (n) in the stochastic code-book. The result is the index j and gain g. • As the last step usually, gains are optimized (contribution from both code-books). Speech Coding II Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 19/29 Final Structure of CELP: s(n) W(z) ^p0 + b adaptive CB ... e(n−M) stochastic CB g ... u(n,j) − p1 1/A*(z) ^p1 minimum search 1/A*(z) − p2 ^p2 − minimum search Speech Coding II Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 20/29 CELP equation – optional. . . Searching the adaptive codebook We want to minimize the error E1 that is (for one frame): E1 = N X (p1 (n) − p̂1 (n))2 . (9) N =0 We can rewrite it in more convenient vector notation and develop p̂1 into the gain g and excitation signal q(n − M ), where q(n) is the filtered version of e(n − M ): q(n) = n−1 X h(i)e(n − i − M ) (10) i=0 When we multiply this signal by the gain b, we get the signal p̂1 (n) = bq(n). Speech Coding II Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno (11) 21/29 The goal is to find the minimum energy: E1 = ||p1 − p̂1 ||2 = ||p1 − bq(M ) ||2 . (12) which we can rewrite using inner products of involved vectors: E1 =< p1 − bq(M ) , p1 − bq(M ) >=< p1 , p1 > −2b < p1 , q(M ) > +b2 < q(M ) , q(M ) > . (13) We have to begin by finding an optimal gain for each M by a derivation with respect to b (the derivation must be zero for minimum energy): i δ h < p1 , p1 > −2b < p1 , q(M ) > +b2 < q(M ) , q(M ) > = 0 (14) δb from here, we can directly write the result for lag M : b(M ) Speech Coding II < p1 , q(M ) > = < q(M ) , q(M ) > Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno (15) 22/29 We need now to determine the optimal lag. We will substitute found b(M ) in Eq. 13 and we obtain: the minimization of · ¸ (M ) (M ) (M ) 2 (M ) (M ) 2 < p1 , q >< p1 , q > < p1 , q > <q ,q > min < p1 , p1 > − . + (M ) (M ) (M ) (M ) 2 <q ,q > <q ,q > (16) we can drop the first term, as it does not influence the minimization and convert minimization into the maximization of: Speech Coding II < p1 , q(M ) >2 M = arg max < q(M ) , q(M ) > (17) Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 23/29 Searching the stochastic codebook First, all codebook vectors u(j) (n) must be filtered by the filter have to search p̂2 (n) minimizing the energy of error: 1 A⋆ (z) −→ q(j) . Then, we E2 = ||p2 − p̂2 ||2 = ||p2 − gq(j) ||2 Solution: • < p2 − gq(j) , q(j) >= 0 < p2 , q(j) > g= . (j) (j) < q ,q > (j) • min E2 = min < p2 − gq 2 , p2 >= ||p2 || − <p2 ,q(j) >2 . <q(j) ,q(j) > < p2 , q(j) >2 . j = arg max < q(j) , q(j) > Speech Coding II Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 24/29 Example of a CELP encoder ACELP – GSM EFR • classical CELP with an “intelligent code-book”. • Algebraic Codebook Excited Linear Prediction GSM Enhanced Full-rate ETSI 06.60 Speech Coding II Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 25/29 IUDPH 3UHSURFHVVLQJ /3&DQDO\VLV WZLFHSHUIUDPH 3UHSURFHVVLQJ VQ ZLQGRZLQJ DQG DXWRFRUUHODWLRQ 5>@ /HYLQVRQ 'XUELQ $] 5>@ $] /63 LQGLFHV /63 /63 TXDQWL]DWLRQ LQWHUSRODWLRQ IRUWKH VXEIUDPHV A /63 $] Speech Coding II VXEIUDPH 2SHQORRSSLWFKVHDUFK WZLFHSHUIUDPH LQWHUSRODWLRQ IRUWKH VXEIUDPHV /63 $] FRPSXWH $] ZHLJKWHG VSHHFK VXEIUDPHV $] ,QQRYDWLYHFRGHERRN VHDUFK $GDSWLYHFRGHERRN VHDUFK A $] 7R KQ FRPSXWHWDUJHW IRUDGDSWLYH FRGHERRN [Q $] A $] SLWFK LQGH[ ILQGEHVWGHOD\ DQGJDLQ TXDQWL]H /73JDLQ FRPSXWH DGDSWLYH FRGHERRN FRQWULEXWLRQ ILQG RSHQORRSSLWFK [Q FRPSXWH LPSXOVH UHVSRQVH /73 JDLQ LQGH[ KQ Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno FRPSXWHWDUJHW IRU LQQRYDWLRQ [Q ILQGEHVW LQQRYDWLRQ )LOWHUPHPRU\ XSGDWH XSGDWHILOWHU PHPRULHVIRU QH[WVXEIUDPH FRGH LQGH[ FRPSXWH H[FLWDWLRQ IL[HG FRGHERRN JDLQ TXDQWL]DWLRQ IL[HGFRGHERRN JDLQLQGH[ 26/29 • Again frames of 20 ms (160 samples) • Short-term predictor – 10 coefficients ai in two sub-frames, converted to line-spectral pairs, jointly quantized using split-matrix quantization (SMQ). • Excitation: 4 sub-frames of 40 samples (5 ms). • Lag estimation first as open-loop, then as closed-loop from the rough estimation, fractional pitch with the resolution of 1/6 of a sample. • stochastic codebook: algebraic code-book - can contain only 10 nonzero impulses, that can be either +1 or -1 ⇒ fast search (fast correlation - only summation, no multiplication), etc. • 244 bits per frame × 50 = 12.2 kbit/s. • for more information see norm 06.60, available fro download at http://pda.etsi.org • decoder. . . Speech Coding II Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 27/29 IUDPH /63 LQGLFHV GHFRGH/63 LQWHUSRODWLRQ RI/63IRUWKH VXEIUDPHV /63 Speech Coding II VXEIUDPH SLWFK LQGH[ JDLQV LQGLFHV FRGH LQGH[ SRVWSURFHVVLQJ GHFRGH DGDSWLYH FRGHERRN GHFRGH JDLQV FRQVWUXFW H[FLWDWLRQ A V\QWKHVLV VQ ILOWHU SRVWILOWHU AV Q GHFRGH LQQRYDWLYH FRGHERRN A $] Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 28/29 Additional Reading • Andreas Spanias (Arizona University): http://www.eas.asu.edu/~spanias section Publications/Tutorial Papers provides a good outline: “Speech Coding: A Tutorial Review”, a part of it is printed in Proceedings of the IEEE, Oct. 1994. • at the same pade in section Software/Tools/Demo - Matlab Speech Coding Simulations, software for FS1015, FS1016, RPE-LTP and more. Nice playing. • Norm ETSI for cell phones are available free for download from: http://pda.etsi.org/pda/queryform.asp As the keywords enter fro instance “gsm half rate speech”. Many norms are provided with the source code of the coders in C. Speech Coding II Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 29/29