Speech Coding Techoniques

advertisement
Speech Coding Techniques
潘奕誠
4/7/2003
Introduction

Efficient speech-coding techniques





Advantages for VoIP
Digital streams of ones and zeros
The lower the bandwidth, the lower the
quality
RTP payload types
Processing power


The better quality (for a given bandwidth)
uses a more complex algorithm
A balance between quality and cost
Voice Quality

Bandwidth is easily quantified


Voice quality is subjective
MOS, Mean Opinion Score

ITU-T Recommendation P.800







Excellent – 5
Good – 4
Fair – 3
Poor – 2
Bad – 1
A minimum of 30 people
Listen to voice samples or in conversations

P.800 recommendations





The selection of participants
The test environment
Explanations to listeners
Analysis of results
Toll quality

A MOS of 4.0 or higher
About Speech

Speech




Model the vocal tract as a filter


Air pushed from the lungs past the vocal
cords and along the vocal tract
The basic vibrations – vocal cords
The sound is altered by the disposition of
the vocal tract ( tongue and mouth)
The shape changes relatively slowly
The vibrations at the vocal cords

The excitation signal
Speech sounds

Voiced sound




Unvoiced sounds





The vocal cords vibrate open and close
Quasi-periodic pulses of air
The rate of the opening and closing – the pitch
Forcing air at high velocities through a constriction
Noise-like turbulence
Show little long-term periodicity
Short-term correlations still present
Plosive sounds


A complete closure in the vocal tract
Air pressure is built up and released suddenly
Voice Sampling

Discrete Time LTI Systems: The
Convolution Sum
x[n] 

 x[k ] [n  k ]
y[n] 
k  
1

 x[k ]h[n  k ]
k  
h[n]
n
0 1 2
2
0.5
0 1
x[n]
2.5
2
0.5
n
0 1 2 3
y[n]
n
 Nyquist sampling theorem
X c ( j )
s (t ) 

  (t  nT )
n  
 N

N
xs (t )  xc (t ) s (t )

 S
0 X c ( j )
S
 xc (t )   (t  nT )
n  

2
S ( j) 
T
S
 N
N
S
(S   N )


  (  k )
k  
s
Quantization (Scalar
Quantization)
v1
m0= -A
vk+1
v2
m1
m2 ……
mk
mk+1
vL
mL1
J
mL=A
k+1
· Assume | x[n] |  A
divide the range [ A , A ] into L quantization levels
{ J1 , J2 , …… Jk ,….. JL }
Jk : [mk-1,mk ]
L=2
R
each quantization level Jk is represented by a value vk
S = U Jk , V = { v1 , v2 , …… vk ,….. vL }
Non-Uniform Quantization
m0 = -A
m1
m2 ……
0
mL=A
Concept : small quantization levels for small x
large quantization levels for large x
Goal: constant SNRQ for all x
Companding
x[n]
F(x)
Uniform
Quantization
Compressor
Uniform
Decoder
…1101…1101…
F1(x)
^
x[n]
Expandor
Compressor + Expandor  Compandor
F(x) is to specify the non-uniform quantization
characteristics
Non-Uniform Quantization

-law
F ( x) 

A-law
log 1  μ x 
log( 1  μ)
,0  x  1

Ax
1
,
0

x



A
F ( x )   1  ln A
1  ln[ A x ] , 1  x  1

A
 1  ln A
 Typical values in practice
 = 255 , A = 87.6
Types of Speech Codecs

Waveform codecs,source codecs (also
known as vocoders),and hybrid codecs.
Speech Source Model and
Source Coding
G(z), G(), g[n]
unvoiced
random
sequence
generator
periodic
pulse
train
generator

G
v/u
voiced
u[n]
Excitation parameters
1
G(z) =
x[n]v/u : voiced/ unvoiced
P
N : pitch for voiced
k=1
G : signal gain
1  akz-k
Vocal Tract Model
 excitation signal u[n]
N
Vocal Tract parameters
Excitation
A good approximation,
though not precise enough
{ak} : LPC coefficients
formant structure of
speech signals
LPC Vocoder(Voice Coder)
x[n]
LPC
Analysis
{ ak }
N,G
v/u
Encoder
…11011…
N by pitch detection
v/u by voicing detection
receiver
Decoder
…11011…
{ ak }
N,G
v/u
x[n]
Ex
g[n]
G(z)
{ak} can be non-uniform or vector
quantized to reduce bit rate further
G.711



The most commonplace codec
 Used in circuit-switched telephone network
 PCM, Pulse-Code Modulation
If uniform quantization
 12 bits * 8 k/sec = 96 kbps
Non-uniform quantization
 65 kbps DS0 rate

  law
North America
A-law
 Other countries, a little friendlier to
lower signal levels
An MOS of about 4.3



ADPCM(adaptive differential
PCM)

DPCM and ADPCM.

ADPCM : Adaptive Prediction in DPCM
Adaptive Quantization
Adaptive Quantization





Quantization level  varies with local signal level
[n] = ax[n]
x[n] : locally estimated standard deviation of x[n]
G.721:ADPCM-coded speech at 32Kbps.
G.726(A-law or   law )


16,24,32,40Kbps
MOS 4.0 , at 32Kbps
Analysis-by-Synthesis (AbS)
Codecs

Hybrid codec
 Fill the gap between waveform and source codecs
 The most successful and commonly used
 Time-domain AbS codecs
 Not a simple two-state, voiced/unvoiced
 Different excitation signals are attempted
 Closest to the original waveform is selected
 MPE, Multi-Pulse Excited
 RPE, Regular-Pulse Excited
 CELP, Code-Excited Linear Predictive
G.728 LD-CELP

CELP codecs




A filter; its characteristics change over time
A codebook of acoustic vectors
 A vector = a set of elements representing various char.
of the excitation
Transmit
 Filter coefficients, gain, a pointer to the vector chosen
Low Delay CELP

Backward-adaptive coder
 Use previous samples to determine filter coefficients
 Operates on five samples at a time
 Delay < 1 ms
 Only the pointer is transmitted




1024 vectors in the code book
10-bit pointer (index)
16 kbps
LD-CELP encoder

Minimize a frequency-weighted mean-square error

LD-CELP decoder


An MOS score of about 3.9
One-quarter of G.711 bandwidth
G.723.1 ACELP

6.3 or 5.3 kbps



Both mandatory
Can change from one to another during a conversation
The coder






A band-limited input speech signal
Sampled at 8 KHz, 16-bit uniform PCM quantization
Operate on blocks of 240 samples at a time
A look-ahead of 7.5 ms
A total algorithmic delay of 37.5 ms + other delays
A high-pass filter to remove any DC component

G.723.1 Annex A


The two lsbs of the first octet




Silence Insertion Description (SID) frames
of size four octets
00
01
10
6.3kbps
24 octets/frame
5.3kbps
20
SID frame 4
An MOS of about 3.8

At least 37.5 ms delay
G.729



8 kbps
Input frames of 10 ms, 80 samples for 8 KHz
sampling rate
5 ms look-ahead



Algorithmic delay of 15 ms
An 80-bit frame for 10 ms of speech
A complex codec




G.729.A (Annex A), a number of simplifications
Same frame structure
Encoder/decoder, G.729/G.729.A
Slightly lower quality

G.729.B

VAD, Voice Activity Detection



DTX, Discontinuous Transmission





Based on analysis of several parameters of the input
The current frames plus two preceding frames
Send nothing or send an SID frame
SID frame contains information to generate comfort
noise
CNG, Comfort Noise Generation
G.729, an MOS of about 4.0
G.729A an MOS of about 3.7
Other Codecs

CDMA QCELP defined in IS-733


Variable-rate coder
Two most common rates




The high rate, 13.3 kbps
A lower rate, 6.2 kbps
Silence suppression
For use with RTP, RFC 2658

GSM Enhanced Full-Rate (EFR)




GSM 06.60
An enhanced version of GSM Full-Rate
ACELP-based codec
The same bit rate and the same overall
packing structure



12.2 kbps
Support discontinuous transmission
For use with RTP, RFC 1890

GSM Adaptive Multi-Rate (AMR) codec








GSM 06.90
Eight different modes
4.75 kbps to 12.2 kbps
12.2 kbps, GSM EFR
7.4 kbps, IS-641 (TDMA cellular systems)
Change the mode at any time
Offer discontinuous transmission
The coding choice of many 3G wireless
networks

The MOS values are for laboratory
conditions


G.711 does not deal with lost packets
G.729 can accommodate a lost frame by
interpolating from previous frames


But cause errors in subsequent speech frames
Processing Power


G.728 or G.729, 40 MIPS
G.726 10 MIPS

Cascaded Codecs



E.g., G.711 stream -> G.729
encoder/decoder
Might not even come close to G.729
Each coder only generate an
approximate of the incoming signal
Tones, Signal, and DTMF
Digits

The hybrid codecs are optimized for human
speech






Other data may need to be transmitted
Tones: fax tones, dialing tone, busy tone
DTMF digits for two-stage dialing or voice-mail
G.711 is OK
G.723.1 and G.729 can be unintelligible
The ingress gateway needs to intercept


The tones and DTMT digits
Use an external signaling system



Easy at the start of a call
Difficult in the middle of a call
Encode the tones differently form the speech




Send them along the same media path
An RTP packet provides the name of the tone and the
duration
Or, a dynamic RTP profile; an RTP packet containing the
frequency, volume and the duration
RFC 2198
 An RTP payload format for redundant audio data
 Sending both types of RTP payload

RTP Payload Format for DTMF Digits



An Internet Draft
Both methods described before
A large number of tones and events


DTMF digits, a busy tone, a congestion tone, a
ringing tone, etc.
The named events

E: the end of the tone, R: reserved

Payload format
Finis
Discrete Time LTI Systems:
The Convolution Sum
x[n] 

 x[k ] [n  k ]
k  
y[n] 

 x[k ]h[n  k ]
k  
1
h[n]
n
0 1 2
2
0.5
0 1
x[n]
2.5
2
0.5
n
0 1 2 3
y[n]
n
Frequency-Domain
Representation of Sampling
X c ( j )
s (t ) 

  (t  nT )
n  
 N

N
xs (t )  xc (t ) s (t )

 S
0 X c ( j )
S
 xc (t )   (t  nT )
n  

2
S ( j) 
T
S
 N
N
S
(S   N )


  (  k )
k  
s
Speech Source Model and
Source Coding

Vocal Tract Model
p
u (n)   ak x[n  k ]  x[n]
k 1
G( z ) 
1
p
1   ak z  k
k 1
X ( z)

U ( z)
Download