Course Outline

advertisement
Speech-Coding Techniques
Chapter 3
Introduction

Efficient speech-coding techniques





Advantages for VoIP
Digital streams of ones and zeros
The lower the bandwidth, the lower the quality
RTP payload types
Processing power


The better quality (for a given bandwidth) uses a
more complex algorithm
A balance between quality and cost
Internet Telephony
3-2
Voice Quality

Bandwidth is easily quantified


Voice quality is subjective
MOS, Mean Opinion Score

ITU-T Recommendation P.800







Excellent – 5
Good – 4
Fair – 3
Poor – 2
Bad – 1
A minimum of 30 people
Listen to voice samples or in conversations
Internet Telephony
3-3

P.800 recommendations





The selection of participants
The test environment
Explanations to listeners
Analysis of results
Toll quality

A MOS of 4.0 or higher
Internet Telephony
3-4


Subjective and objective quality-testing
techniques
PSQM – Perceptual Speech Quality
Measurement




ITU-T P.861
faithfully represent human judgement and
perception
algorithmic comparison between the output signal
and a know input
type of speaker, loudness, delay, active/silence
frames, clipping, environmental noise
Internet Telephony
3-5
A Little About Speech

Speech




Model the vocal tract as a filter


Air pushed from the lungs past the vocal cords
and along the vocal tract
The basic vibrations – vocal cords
The sound is altered by the disposition of the
vocal tract ( tongue and mouth)
The shape changes relatively slowly
The vibrations at the vocal cords

The excitation signal
Internet Telephony
3-6
Speech sounds

Voiced sound





The vocal cords vibrate open and close
Interrupt the air flow
Quasi-periodic pluses of air
The rate of the opening and closing – the pitch
A high degree of periodicity at the pitch period

2-20 ms
Internet Telephony
3-7

Voiced speech

Power spectrum density
Internet Telephony
3-8

Unvoiced sounds





Forcing air at high velocities through a constriction
The glottis is held open
Noise-like turbulence
Show little long-term periodicity
Short-term correlations still present
Internet Telephony
3-9

unvoiced speech

Power spectrum density
Internet Telephony 3-10

Plosive sounds



A complete closure in the vocal tract
Air pressure is built up and released suddenly
A vast array of sounds


The speech signal is relatively predictable over
time
The reduction of transmission bandwidth can be
significant
Internet Telephony 3-11
Voice Sampling

A-to-D



discrete samples of the waveform and represent
each sample by some number of bits
A signal can be reconstructed if it is sampled at a
minimum of twice the maximum freq.
Human speech


300-3800 Hz
8000 samples per second
Each sample is encoded into
an 8-bit PCM code word
(e.g. 01100101)
time
=> 8000 x 8 bit/s
Internet Telephony 3-12
Quantization


How many bits is used to represent
Quantization noise


More bits to reduce


The difference between the actual level of the
input analog signal
Diminishing returns
Uniform quantization levels


Louder talkers sound better
11.2/11 v.s. 2.2/2
Internet Telephony 3-13

Non-uniform quantization


Smaller quantization steps at smaller signal levels
Spread signal-to-noise ratio more evenly
Internet Telephony 3-14
DTX and Comfort Noise



DTX is Discontinuous Transmission
Voice activity detector (VAD) detects if there
is active speech or not.
When there is no active speech different DTX
procedures can be used:




No Transmission at all
Comfort Noise (CN) using RFC 3389
Codec built CN in like AMR SID (Silence Descriptor)
Frequency of Comfort Noise packets varies
but is usually some fraction of normal packet
rate
Internet Telephony 3-15
Type of Speech Coders

Waveform codecs




Sample and code
High-quality and not complex
Large amount of bandwidth
source codecs (vocoders)






Match the incoming signal to a math model
Linear-predictive filter model of the vocal tract
A voiced/unvoiced flag for the excitation
The information is sent rather than the signal
Low bit rates, but sounds synthetic
Higher bit rates do not improve much
Internet Telephony 3-16

Hybrid codecs




Attempt to provide the best of both
Perform a degree of waveform matching
Utilize the sound production model
Quite good quality at low bit rate
Internet Telephony 3-17
G.711

The most commonplace codec



If uniform quantization


Used in circuit-switched telephone network
PCM, Pulse-Code Modulation
12 bits * 8 k/sec = 96 kbps
Non-uniform quantization


64 kbps DS0 rate
mu-law


A-law


North America
Other countries, a little friendlier to lower signal levels
An MOS of about 4.3
Internet Telephony 3-18
DPCM

DPCM, Differential PCM





Only transmit the difference between the predicated value and
the actual value
Voice changes relatively slowly
It is possible to predict the value of a sample base on the
values of previous samples
The receiver perform the same prediction
The simplest form


No prediction
No algorithmic delay
Internet Telephony 3-19
ADPCM

ADPCM, Adaptive DPCM

Predicts sample values based on



The error is quantized and transmitted


Fewer bits required
G.721


Past samples
Factoring in some knowledge of how speech varies over
time
32 kbps
G.726


A-law/mu-law PCM -> 16, 24, 32, 40 kbps
An MOS of about 4.0 at 32 kbps
Internet Telephony 3-20
Analysis-by-Synthesis (AbS) Codecs

Hybrid codec


Fill the gap between waveform and source codecs
The most successful and commonly used







Time-domain AbS codecs
Not a simple two-state, voiced/unvoiced
Different excitation signals are attempted
Closest to the original waveform is selected
MPE, Multi-Pulse Excited
RPE, Regular-Pulse Excited
CELP, Code-Excited Linear Predictive
Internet Telephony 3-21
G.728 LD-CELP

CELP codecs


A filter; its characteristics change over time
A codebook of acoustic vectors


Transmit


A vector = a set of elements representing various char.
of the excitation
Filter coefficients, gain, a pointer to the vector chosen
Low Delay CELP

Backward-adaptive coder


Use previous samples to determine filter coefficients
Operates on five samples at a time


Delay < 1 ms
Only the pointer is transmitted
Internet Telephony 3-22




1024 vectors in the code book
10-bit pointer (index)
16 kbps
LD-CELP encoder

Minimize a frequency-weighted mean-square error
Internet Telephony 3-23

LD-CELP decoder


An MOS score of about 3.9
One-quarter of G.711 bandwidth
Internet Telephony 3-24
G.723.1 ACELP

6.3 or 5.3 kbps



Both mandatory
Can change from one to another during a
conversation
The coder






A band-limited input speech signal
Sampled at 8 KHz, 16-bit uniform PCM quantization
Operate on blocks of 240 samples at a time
A look-ahead of 7.5 ms
A total algorithmic delay of 37.5 ms + other delays
A high-pass filter to remove any DC component
Internet Telephony 3-25




Various operations to determine the appropriate
filter coefficients
5.3 kbps, Algebraic Code-Excited Linear Prediction
6.3 kbps, Multi-pulse Maximum Likelihood
Quantization
The transmission




Linear predication coefficients
Gain parameters
Excitation codebook index
24-octet frames at 6.3 kbps, 20-octet frames at 5.3 kbps
Internet Telephony 3-26

G.723.1 Annex A


The two lsbs of the first octet




Silence Insertion Description (SID) frames of size
four octets
00
01
10
6.3kbps
5.3kbps
SID frame
24 octets/frame
20
4
An MOS of about 3.8

At least 27.5 ms delay
Internet Telephony 3-27
G.729



8 kbps
Input frames of 10 ms, 80 samples for 8 KHz
sampling rate
5 ms look-ahead



Algorithmic delay of 15 ms
An 80-bit frame for 10 ms of speech
A complex codec




G.729.A (Annex A), a number of simplifications
Same frame structure
Encoder/decoder, G.729/G.729.A
Slightly lower quality
Internet Telephony 3-28

G.729.B

VAD, Voice Activity Detection



DTX, Discontinuous Transmission





Based on analysis of several parameters of the input
The current frames plus two preceding frames
Send nothing or send an SID frame
SID frame contains information to generate comfort
noise
CNG, Comfort Noise Generation
G.729, an MOS of about 4.0
G.729A an MOS of about 3.7
Internet Telephony 3-29

G.729 Annex D




a lower-rate extension
6.4 kbps; 10 ms speech samples, 64 bits/frame
MOS  6.3 kbps G.723.1
G.729 Annex E






a higher bit rate enhancement
the linear prediction filter of G.729 has 10 coef.
that of G.729 Annex E has 30 coef.
the codebook of G.729 has 35 bits
that of G.729 Annex E has 44 bits
118 bits/frame; 11.8 kbps
Internet Telephony 3-30
Other Codecs

CDMA QCELP defined in IS-733


Variable-rate coder
Two most common rates




The high rate, 13.3 kbps
A lower rate, 6.2 kbps
Silence suppression
For use with RTP, RFC 2658
Internet Telephony 3-31

GSM Enhanced Full-Rate (EFR)




GSM 06.60
An enhanced version of GSM Full-Rate
ACELP-based codec
The same bit rate and the same overall packing
structure



12.2 kbps
Support discontinuous transmission
For use with RTP, RFC 1890
Internet Telephony 3-32

GSM Adaptive Multi-Rate (AMR) codec







20 ms coding delay
Eight different modes
4.75 kbps to 12.2 kbps
12.2 kbps, GSM EFR
7.4 kbps, IS-641 (TDMA cellular systems)
Change the mode at any time
Offer discontinuous transmission


The SID (Silence Descriptor) is sent in every 8th frame
and is 5 bytes in size
The coding choice of many 3G wireless networks
Internet Telephony 3-33

The MOS values are for laboratory conditions


G.711 does not deal with lost packets
G.729 can accommodate a lost frame by
interpolating from previous frames


But cause errors in subsequent speech frames
Processing Power


G.728 or G.729, 40 MIPS
G.726 10 MIPS
Internet Telephony 3-34
iLBC



a FREE codec for robust VoIP
13.33 kbit/s with an encoding frame length of
30 ms and 15.20 kbps of 20 ms
Computational complexity in a range of G.729A
Internet Telephony 3-35
Speex



Open-source patent-free speech codec
CELP (code-excited linear prediction) codec
operating modes:

narrowband (8 kHz sampling rate)



wideband (16 kHz sampling rate)






2.15 – 24.6 kb/s
delay of 30 ms
4-44.2 kb/s
delay of 34 ms
ultra-wideband (32 kHz sampling rate)
intensity stereo encoding
variable bit rate (VBR) possible
voice activity detection (VAD)
Internet Telephony 3-36

Cascaded Codecs




E.g., G.711 stream -> G.729 encoder/decoder
Might not even come close to G.729
Each coder only generate an approximate of
the incoming signal
Audio samples

http://www.cs.columbia.edu/~hgs/audio/codecs.h
tml
Internet Telephony 3-37
Effects of packetization
Internet Telephony 3-38
Tones, Signal, and DTMF Digits

The hybrid codecs are optimized for human
speech






Other data may need to be transmitted
Tones: fax tones, dialing tone, busy tone
DTMF digits for two-stage dialing or voice-mail
G.711 is OK
G.723.1 and G.729 can be unintelligible
The ingress gateway needs to intercept


The tones and DTMF digits
Use an external signaling system
Internet Telephony 3-39



Easy at the start of a call
Difficult in the middle of a call
Encode the tones differently from the speech




Send them along the same media path
An RTP packet provides the name of the tone and the
duration
Or, a dynamic RTP profile; an RTP packet containing the
frequency, volume and the duration
RFC 2198


An RTP payload format for redundant audio data
Sending both types of RTP payload
Internet Telephony 3-40

RTP Payload Format for DTMF Digits



An Internet Draft
Both methods described before
A large number of tones and events


DTMF digits, a busy tone, a congestion tone, a ringing
tone, etc.
The named events

E: the end of the tone, R: reserved
Internet Telephony 3-41

Payload format
Internet Telephony 3-42
Download