ITU-T G.729 - ECE Users Pages

advertisement
ITU-T G.729
EE8873
Rungsun Munkong
March 22, 2004
Outline
• Introduction to ITU-T G.729
• Overall encoder and decoder
• Key components
- Decoder
- Perceptual Weighting
- LP analysis
- Adaptive codebook
- Fixed codebook
• Speech Demo
Introduction
• Proposed 03/96 by ITU, low bit rate, low complexity, toll
quality (MOS 3.9) using CS-ACELP (Conjugate-Structure
Algebraic-Excited Linear-Prediction)
• Input is band-limited, 8 kHz sampling, 16-bit PCM speech
(128 kbs). Since output rate is 8kbs, the compression ratio
of 16:1is achieved.
• Short-term synthesis filter is based on a 10th order Linear
Prediction (LP) filter every 10 ms frame.
• Long-term pitch synthesis filter is implemented using the
adaptive-code book.
• Low algorithmic delay (10ms current frame + 5ms
lookahead)
• Preferred over G.711 64kbs and 32kbs ADPCM coders due
to superior bandwidth utilization. Used in VoIP, wireless
communications, digital satellite systems.
Coding bit distribution
LSP
18
frame total
subframe2
subframe1
pitch
5
pitch
9
pitch
14
excit.
17
gains
7
excit.
17
gains
7
excit.
34
gains
14
LSP
pitch
excit.
gains
Encoder (analysis-by-synthesis)
x[n]
Decoder
LP info
LP analysis
Excitation
Generator
Synthesis
Filter
Excitation,
Pitch &Gain
Calculator
Perceptual
Weighting
d’[n]
Parameter
encoding
x~[n]
d[n]
Transmitted
bitstream
Synthesis Filter
• 10th order polynomial.
• Synthesis filter:
1
1
H ( z) 

10
Aˆ ( z ) 1  aˆ z i
 i
i 1
Where âi are quantized LP coefs obtained from
quantized LSPs.
1st subframe: interpolated between current and
previous frames values.
2nd subframe: current frame values.
Decoder
Output speech
Excitation
Generator
Synthesis
Filter
LP info
Parameter Decoder
bitstream
Postprocessing
Decoder
• Decode codebook parameters by table lookup.
• LSP coefs interpolated and converted to LP coefs
for 2 subframes.
• Excitation = sum of adaptive and fixed codebook
vectors multiplied by their respective gains in each
subframe.
• Speech = excitation through vocal tract filter.
• Enhanced perceived quality by adaptive postfiltering.
Spectral tilt
 1  10  i aˆ i z i 



n
1
i 1
H p ( z )  K (1  g l z )
(
1

g
z
)

t
10
i i i
 1    d aˆ z 
 Formant sharpness
Long-term postfilter  i 1
T
LP Analysis
Preprocessed
input
Windowing,
Autocorrelation,
Levinson Durbin
A(z)
LSP
LSP
quantization
L0,L1,L2,L3
LSP Index
Conversion of A(z) -> LSP
LSP are the roots of two polynomials
F1 ( z )  [ A( z )  Z 11 A( z 1 )] /(1  z 1 )
F2 ( z )  [ A( z )  Z 11 A( z 1 )] /(1  z 1 )
• 5 unique roots of each polynomial are computed
by evaluating 60 equally spaced freqs between 0
to  then fine tune at sign shift intervals.
• The difference between these roots and the 4thorder MA prediction of the roots are quantized.
• 2-stage VQ: (1) 7-bit codebook L1, (2) Split 10 bit
VQ into 5-bit codebook L2 and 5-bit codebook
L3. 1-bit L0 chooses which set of MA is best.
Perceptual Weighting
10
1    1 ai z i
LP filter
i
A( z /  1 )
i 1
W ( z) 

10
A( z /  2 ) 1    i a z i
2 i
Vocal Tract
i 1
original
weighted
unit circle
Frequency Response for /ey
20
LP filter
dB
10
vocal tract filter
0
-10
Perceptual Weight Fn.
-20
0
500
Flat:  1  0.94,  2  0.6 Tilted:
1000
1500
2000
2500
frequency (Hz)
3000
3500
4000
 1  0.94,  2  0.4  0.7
Adaptive Codebook
• Determine pitch delay and pitch gain (periodic portion of
excitation)
• Candidate delay T_op, selected as the delay giving highest
correlation from a perceptually weighted speech frame.
• 1st subframe: T1 found by searching within 3 samples of
T_op (range 20-85 with resolution 1/3, range 85-143 at
resolution 1). Then encoded into 8 bits.
• 2nd subframe: T2 found by searching within int(T1)-5.67,
int(T1)+4.67 with resolution 1/3. Then encoded into 5
bits.
• Parity bit P0 computed in 1st subframe XOR of 6 MSB of
P1 as the bit error protection.
• Gain = normalized correlation of the reconstructed signal
and pitch shifted reconstructed signal.
Fixed Codebook
• Algebraic codebook structure. Each vector contains 4
nonzero pulses. The possible values are:
Pulse
Sign
Positions
i0
S0 =-1,+1
m0 = 0,5, 10,15,
20, 25, 30, 35
i1
S1 =-1,+1
m1=1,6,11,16,21,
26,31,36
i2
S2 =-1,+1
M2=2,7,12,17,22,
27,32,37
i3
S3 =-1,+1
M3=3,8,13,18,23,
28,33,38,4,9,14,
19,24,29,34,39
• Codebook vector c(n) is 40-dim with four unit pulses at
found locations with corresponding signs.
Fixed Codebook
• Encode the random portion of the excitation signal.
• The periodic portion of the weighted residual is first
removed. Only the random portion is remained to be
coded by fixed codebook.
• Codebook search by minimize error between perceptual
weighted input speech and reconstructed speech.
39
E   ( x[n] c[n] * h[n]) * w[n]) 2
n 0
• For each subframe: sign and positions of 4 nonzero pulses
computed encoded into 17 bits.
Speech Demo
female speech:
G.729 decoded:
male speech:
G.729 decoded:
G.729 Addition
• Annex A (11/96) use ½ CPU power at minimal
reduction in perceived quality (MOS 3.7)
• Annex B (10/96) adds discontinuous transmission
(DTX), voice activation detection (VAD),
background noise modeling, comfort noise
generation (CNG), silence frame insertion.
• Annex D, E (09/98) 6.4 kbit/s and 11.8 kbit/s CSACELP speech coding algorithm.
• Annex F, G, H, I (98-2001) enhance capabilities of
previous annexes (e.g. DTX/VAD/CNG) and also
integrate different bit rates codecs.
References
1. ITU-T G.729 official recommendation, available at:
http://www.itu.int/rec/recommendation.asp?type=folders&lan
g=e&parent=T-REC-G.729
2. Andreas S. Spanias, “Speech Coding, A Tutorial Review”,
Proc. of IEEE, Vol. 82, No 10, pp. 1541-1582, October
1994
3. GAO Research Inc. G.729 description and demo:
http://www.gaoresearch.com/products/speechsoftware/ot
her/g729.php
4. Jade Clayton, Privateline writing about G.729 and other
standards for VoIP:
http://www.privateline.com/clayton/clayton2.htm
Download