ITU-T G.729 EE8873 Rungsun Munkong March 22, 2004 Outline • Introduction to ITU-T G.729 • Overall encoder and decoder • Key components - Decoder - Perceptual Weighting - LP analysis - Adaptive codebook - Fixed codebook • Speech Demo Introduction • Proposed 03/96 by ITU, low bit rate, low complexity, toll quality (MOS 3.9) using CS-ACELP (Conjugate-Structure Algebraic-Excited Linear-Prediction) • Input is band-limited, 8 kHz sampling, 16-bit PCM speech (128 kbs). Since output rate is 8kbs, the compression ratio of 16:1is achieved. • Short-term synthesis filter is based on a 10th order Linear Prediction (LP) filter every 10 ms frame. • Long-term pitch synthesis filter is implemented using the adaptive-code book. • Low algorithmic delay (10ms current frame + 5ms lookahead) • Preferred over G.711 64kbs and 32kbs ADPCM coders due to superior bandwidth utilization. Used in VoIP, wireless communications, digital satellite systems. Coding bit distribution LSP 18 frame total subframe2 subframe1 pitch 5 pitch 9 pitch 14 excit. 17 gains 7 excit. 17 gains 7 excit. 34 gains 14 LSP pitch excit. gains Encoder (analysis-by-synthesis) x[n] Decoder LP info LP analysis Excitation Generator Synthesis Filter Excitation, Pitch &Gain Calculator Perceptual Weighting d’[n] Parameter encoding x~[n] d[n] Transmitted bitstream Synthesis Filter • 10th order polynomial. • Synthesis filter: 1 1 H ( z) 10 Aˆ ( z ) 1 aˆ z i i i 1 Where âi are quantized LP coefs obtained from quantized LSPs. 1st subframe: interpolated between current and previous frames values. 2nd subframe: current frame values. Decoder Output speech Excitation Generator Synthesis Filter LP info Parameter Decoder bitstream Postprocessing Decoder • Decode codebook parameters by table lookup. • LSP coefs interpolated and converted to LP coefs for 2 subframes. • Excitation = sum of adaptive and fixed codebook vectors multiplied by their respective gains in each subframe. • Speech = excitation through vocal tract filter. • Enhanced perceived quality by adaptive postfiltering. Spectral tilt 1 10 i aˆ i z i n 1 i 1 H p ( z ) K (1 g l z ) ( 1 g z ) t 10 i i i 1 d aˆ z Formant sharpness Long-term postfilter i 1 T LP Analysis Preprocessed input Windowing, Autocorrelation, Levinson Durbin A(z) LSP LSP quantization L0,L1,L2,L3 LSP Index Conversion of A(z) -> LSP LSP are the roots of two polynomials F1 ( z ) [ A( z ) Z 11 A( z 1 )] /(1 z 1 ) F2 ( z ) [ A( z ) Z 11 A( z 1 )] /(1 z 1 ) • 5 unique roots of each polynomial are computed by evaluating 60 equally spaced freqs between 0 to then fine tune at sign shift intervals. • The difference between these roots and the 4thorder MA prediction of the roots are quantized. • 2-stage VQ: (1) 7-bit codebook L1, (2) Split 10 bit VQ into 5-bit codebook L2 and 5-bit codebook L3. 1-bit L0 chooses which set of MA is best. Perceptual Weighting 10 1 1 ai z i LP filter i A( z / 1 ) i 1 W ( z) 10 A( z / 2 ) 1 i a z i 2 i Vocal Tract i 1 original weighted unit circle Frequency Response for /ey 20 LP filter dB 10 vocal tract filter 0 -10 Perceptual Weight Fn. -20 0 500 Flat: 1 0.94, 2 0.6 Tilted: 1000 1500 2000 2500 frequency (Hz) 3000 3500 4000 1 0.94, 2 0.4 0.7 Adaptive Codebook • Determine pitch delay and pitch gain (periodic portion of excitation) • Candidate delay T_op, selected as the delay giving highest correlation from a perceptually weighted speech frame. • 1st subframe: T1 found by searching within 3 samples of T_op (range 20-85 with resolution 1/3, range 85-143 at resolution 1). Then encoded into 8 bits. • 2nd subframe: T2 found by searching within int(T1)-5.67, int(T1)+4.67 with resolution 1/3. Then encoded into 5 bits. • Parity bit P0 computed in 1st subframe XOR of 6 MSB of P1 as the bit error protection. • Gain = normalized correlation of the reconstructed signal and pitch shifted reconstructed signal. Fixed Codebook • Algebraic codebook structure. Each vector contains 4 nonzero pulses. The possible values are: Pulse Sign Positions i0 S0 =-1,+1 m0 = 0,5, 10,15, 20, 25, 30, 35 i1 S1 =-1,+1 m1=1,6,11,16,21, 26,31,36 i2 S2 =-1,+1 M2=2,7,12,17,22, 27,32,37 i3 S3 =-1,+1 M3=3,8,13,18,23, 28,33,38,4,9,14, 19,24,29,34,39 • Codebook vector c(n) is 40-dim with four unit pulses at found locations with corresponding signs. Fixed Codebook • Encode the random portion of the excitation signal. • The periodic portion of the weighted residual is first removed. Only the random portion is remained to be coded by fixed codebook. • Codebook search by minimize error between perceptual weighted input speech and reconstructed speech. 39 E ( x[n] c[n] * h[n]) * w[n]) 2 n 0 • For each subframe: sign and positions of 4 nonzero pulses computed encoded into 17 bits. Speech Demo female speech: G.729 decoded: male speech: G.729 decoded: G.729 Addition • Annex A (11/96) use ½ CPU power at minimal reduction in perceived quality (MOS 3.7) • Annex B (10/96) adds discontinuous transmission (DTX), voice activation detection (VAD), background noise modeling, comfort noise generation (CNG), silence frame insertion. • Annex D, E (09/98) 6.4 kbit/s and 11.8 kbit/s CSACELP speech coding algorithm. • Annex F, G, H, I (98-2001) enhance capabilities of previous annexes (e.g. DTX/VAD/CNG) and also integrate different bit rates codecs. References 1. ITU-T G.729 official recommendation, available at: http://www.itu.int/rec/recommendation.asp?type=folders&lan g=e&parent=T-REC-G.729 2. Andreas S. Spanias, “Speech Coding, A Tutorial Review”, Proc. of IEEE, Vol. 82, No 10, pp. 1541-1582, October 1994 3. GAO Research Inc. G.729 description and demo: http://www.gaoresearch.com/products/speechsoftware/ot her/g729.php 4. Jade Clayton, Privateline writing about G.729 and other standards for VoIP: http://www.privateline.com/clayton/clayton2.htm