lec2 slides - ECE - University of Maryland

advertisement
ENEE408G Spring 2004
Lecture-2
Digital Speech Processing and Coding
Fall’05 Instructor: Carol Espy-Wilson
Electrical & Computer Engineering
University of Maryland, College Park
http://www.ece.umd.edu/class/enee408g/
http://umd.blackbloard.com/
 minwu@umd.edu
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
UMCP ENEE408G Slides (created by M.Wu & R.Liu © 2002)
Last Lecture

Course overview and logistics

Bring multimedia to digital world: sampling &
quantization

Introduction to speech processing
– Different aspects of speech

Friday Lab Session
– Speech Processing, Coding, Recognition, & HCI

Today: speech processing, coding, synthesis
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
Lec2 – Introduction 2/4/04 [2]
Speech Production
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
Lec2 – Introduction 2/4/04 [3]
Source-Filter View of Speech Production (Stevens 1999)
Source Spectrum
Vocal tract
transfer function
Radiation
Characteristics
Power spectrum
of speech signal
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
Lec2 – Introduction 2/4/04 [4]
“Sprouted grains and seeds are used in salads and dishes such as chop suey”
6000
4000
F2
2000
0.0
2.0
1.0
3.0
4.0
Time (sec)
8000
Frequency (kHz)
Frequency (kHz)
UMCP ENEE408G Slides (created by Carol Espy-Wilson © 2004)
8000
6000
4000
2000
0.1
fricative
0.5
glide
stop0.3
vowel
vowel stop
consonant
consonant
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
Lec2 – Introduction 2/4/04 [5]
UMCP ENEE408G Slides (created by Carol Espy-Wilson © 2004)
Phonetic Features (Chomsky & Halle, 1968)

There are three kinds of phonetic features
– Source features determine the kind of excitation signal
– Manner of articulation features determine how open or
closed is the vocal tract
– Place of articulation features determine the location of
primary constriction
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
Lec2 – Introduction 2/4/04 [6]
UMCP ENEE408G Slides (created by Carol Espy-Wilson © 2004)
Source feature “voiced”
+voiced
/z/
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
-voiced
/s/
Lec2 – Introduction 2/4/04 [7]
“Sprouted”
8000
Frequency (kHz)
UMCP ENEE408G Slides (created by Carol Espy-Wilson © 2004)
Source Feature voiced
6000
4000
2000
0.1
turbulence
-voiced
0.3
Time (sec)
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
0.5
vertical striations
+voiced
Lec2 – Introduction 2/4/04 [8]
Glottal Source
(Klatt & Klatt 1990)
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
Lec2 – Introduction 2/4/04 [9]
Voice Quality-APP Detector
Modal Voice
Breathy Voice
Creaky Voice
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
Lec2 – Introduction 2/4/04 [10]
UMCP ENEE408G Slides (created by Carol Espy-Wilson © 2004)
Manner feature “sonorant”
+sonorant
-sonorant
Primary source
above the glottis
at alveolar ridge
vowel
Primary source at glottis
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
/z/
Lec2 – Introduction 2/4/04 [11]
UMCP ENEE408G Slides (created by Carol Espy-Wilson © 2004)
Source Feature sonorant
high frequency energy
-sonorant
“Sprouted”
Frequency (kHz)
8000
6000
4000
2000
0.1
0.3
Time (sec)
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
0.5
low frequency energy
+sonorant
Lec2 – Introduction 2/4/04 [12]
UMCP ENEE408G Slides (created by Carol Espy-Wilson © 2004)
Place feature for stop consonants
+labial
/p/
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
+alveolar
/t/
Lec2 – Introduction 2/4/04 [13]
UMCP ENEE408G Slides (created by Carol Espy-Wilson © 2004)
Place Feature Labial vs. Alveolar
spectral prominence
falling
labial /b/
dB
Frequency (Hz)
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
Lec2 – Introduction 2/4/04 [14]
UMCP ENEE408G Slides (created by Carol Espy-Wilson © 2004)
Place Feature Labial vs. Alveolar
spectral prominence
rising
falling
labial /p/
alveolar /t/
dB
Frequency (Hz)
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
Lec2 – Introduction 2/4/04 [15]
UMCP ENEE408G Slides (created by M.Wu © 2003)
Source-Filter Theory

Figure 1 of SPM May’98
Speech Survey
First “speaking machine” in
1930s NY World’s Fair
– 14 keys, 1 wristband, 1 pedal

Modeling speech production
as a linear system
– Sound sources

Either voiced or unvoiced
– Voice sound

Modeled by a generator of pulses
– Unvoiced sound

Modeled by white noise
generator
– Articulation

Modeled by a cascade of singleresonance (pole) digital filters
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
Lec2 – Introduction 2/4/04 [16]
UMCP ENEE408G Slides (created by R. Liu & M.Wu © 2002)
Linear Separable Model for Speech Production

Vocal tract is modeled as a linear time-varying system
– Parameters of the linear system are slowly varying
– Excited by time-varying
source (voiced or unvoiced)

Practical models
– Model each speech frame
as Linear Time-Invariant
– Excited by either voiced
or unvoiced source
– Allow overlaps in
neighbouring frames
Figure 3.2 of Furui’s book
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
Lec2 – Introduction 2/4/04 [17]
Speech Coding
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
Lec2 – Introduction 2/4/04 [18]
Statistical Properties of Speech
UMCP ENEE408G Slides (created by Carol Espy-Wilson © 2004)
Digital Speech Processing by Rabiner and Shafer
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
Lec2 – Introduction 2/4/04 [19]
Statistical Properties of Speech
UMCP ENEE408G Slides (created by Carol Espy-Wilson © 2004)
Digital Speech Processing by Rabiner and Shafer
Lowpass filtered (0-3400 Hz)
Bandpass filtered
(200-3400 Hz)
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
Lec2 – Introduction 2/4/04 [20]
Statistical Properties of Speech
UMCP ENEE408G Slides (created by Carol Espy-Wilson © 2004)
Digital Speech Processing by Rabiner and Shafer
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
Lec2 – Introduction 2/4/04 [21]
UMCP ENEE408G Slides (created by Carol Espy-Wilson © 2004)
Digital Coding of Speech
Information Capacity I=Bfs
source coding
waveform coding
kbps
64
200
broadcast
quality
16
toll
quality
9.6
7.2
4.8
commun.
quality
Synthetic
quality
0.05
Waveform coders: quantize speech samples directly at high bit
rates.
Source coders (vocoders): use knowledge of speech production
to parameterize the signal (model based)
Hybrid coders: partly waveform based and partly model based
(2.4-16 kbps)
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
Lec2 – Introduction 2/4/04 [22]
UMCP ENEE408G Slides (created by M.Wu & R.Liu © 2002)
PCM coding
I(x,y)
Sampler
Quantizer
Encoder
transmit
Input signal
digitize/capture device

How to encode a signal into bits?
– Sampling and perform uniform quantization (2 parameters: , equal
quantization step size and B, # of bits)



“Pulse Coded Modulation” (PCM)
8 bits per sample ~ good for speech
16 bits ~ needed for high-quality music

Tradeoff between fidelity and file size

How to “squeeze” out redundancy?
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
Lec2 – Introduction 2/4/04 [23]
UMCP ENEE408G Slides (created by Carol Espy-Wilson © 2004)
Discussion on Improving PCM (1)

2 parameters: step size , # of bits B

Peak-to-peak range is 2Xmax,
2 X max

2B

Assume e[n]  x[n]  xˆ[n]
– where e[n] is uncorrelated with x[n], and it is uniformly
1
distributed
pe[e]


2
e 
12



2
 x2
3 22 B
SNR  2 
 e [ X max  x ]2
2
X max
SNR(dB)  6 B  4.77  20log[
]
x
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
Lec2 – Introduction 2/4/04 [25]
Uniform quantization
UMCP ENEE408G Slides (created by Carol Espy-Wilson © 2004)
Digital Speech Processing by Rabiner and Shafer
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
Lec2 – Introduction 2/4/04 [26]
UMCP ENEE408G Slides (created by M.Wu & R.Liu © 2002)
Discussion on Improving PCM (1)

Uniform quantization may give inconsistent range of
relative amount errors
– E.g., +/- 2 incurs 20% vs. 2% at amplitude 10 and 100

Non-uniform quantization
– Assign smaller quantization step size at small amplitude

to maintain consistent range of relative quantization errors over the entire
dynamic range
– Can apply non-linear transform before uniform quantization via
“companding” (compression-expansion)

-law companding: international standard for 64kbps speech
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
Lec2 – Introduction 2/4/04 [27]
UMCP ENEE408G Slides (created by Carol Espy-Wilson © 2004)
Discussion on Improving PCM (1)
y[n]  ln | x[n] |
x[n]  e
( y[ n ])
sign( x[n])
1  x[n]  0
sign( x[n])  
1  x[n]  0
yˆ[n]  ln | x[n] |  [n]
ˆx[n]  e( yˆ [ n ]) sign ( x[ n ])
xˆ[n] | x[n] | sign( x[n])e [ n ]  x[n]e [ n ]
xˆ[n]  x[n](1   [n])  x[n]  x[n] [n]
xˆ[n]  x[n]  e[n]
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
 x2
1
SNR  2 2  2
 x e  e
Lec2 – Introduction 2/4/04 [28]
Discussion on Improving PCM (1)
UMCP ENEE408G Slides (created by Carol Espy-Wilson © 2004)
Digital Speech Processing by Rabiner and Shafer
But,
ln[0]  
y[n]  X max
not practical
| x[n] |
log[1  
]
X max
sign( x[n])
log[1   ]
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
Lec2 – Introduction 2/4/04 [29]
UMCP ENEE408G Slides (created by Carol Espy-Wilson © 2004)
Discussion on Improving PCM (1)
Log Companding Digital Speech Processing by Rabiner and Shafer
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
Lec2 – Introduction 2/4/04 [30]
UMCP ENEE408G Slides (created by M.Wu & R.Liu © 2002)
Discussion on Improving PCM (2)

Quantized PCM values may not be equally likely
– Can we do better than encode each value using same # bits?

Example
– P(“0” ) = 0.5, P(“1”) = 0.25, P(“2”) = 0.125, P(“3”) = 0.125
– If use same # bits for all values

Need 2 bits to represent the four possibilities if treat equally
– If use less bits for likely values “0” ~ Variable Length Codes (VLC)



“0” => [0], “1” => [10], “2” => [110], “3” => [111]
Use 1.75 bits on average ~ saves 0.25 bit per sample!
Bring probability into the picture
– Use probability distribution to reduce average # bits per quantized sample
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
Lec2 – Introduction 2/4/04 [31]
UMCP ENEE408G Slides (created by M.Wu & R.Liu © 2002)
How to Encode Correlated Sequence?

Consider: high correlation between successive samples

Predictive coding
– Basic principle: Remove redundancy between successive pixels and only
encode residual between actual and predicted
– Residue usually has much smaller dynamic range

Allow fewer quantization levels for the same MSE => get compression
– Compression efficiency depends on intersample redundancy

u(n)
First try
e(n)
_
Predictor
eQ(n)
Quantizer
Encoder
u’P(n) = u(n-1)
uQ (n)
eQ(n)
+
uP(n) = uQ(n-1)
Predictor
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
Decoder
Lec2 – Introduction 2/4/04 [40]
Encoder
Predictive Coding (cont’d)
UMCP ENEE408G Slides (created by M.Wu & R.Liu © 2002)
u(n)

Problem with 1st try

eQ(n)
uQ(n)
+
decoder doesn’t know u(n)!
– Mismatch error could propagate to
future reconstructed samples
Quantizer
_
– Input to predictor are different at
encoder and decoder

e(n)
Predictor
uP(n) =uQ(n-1)
uQ (n)
eQ(n)
Solution: Differential PCM (DPCM)
– Use quantized sequence uQ(n) for
prediction at both encoder and decoder
+
uP(n)
= uQ(n-1)
Predictor
Decoder
– Prediction error e(n)
– Quantized prediction error eQ(n)
– Distortion d(n) = e(n) – eQ(n)
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
Think: what predictor to use?
Lec2 – Introduction 2/4/04 [41]
UMCP ENEE408G Slides (created by R. Liu & M.Wu © 2002)
Linear Prediction Analysis of Speech
{ai } are called Linear Prediction Coefficients (LPC)
+
s[n]
e[n]
+
e[n]
_
z 1
+
+
s[ n]
_
z 1
a1
a1
z 1
z 1
a2
a2
z 1
z 1
aP
aP
Analysis
Error Minimization
 Normal equations
Synthesis
min E   E (e2 [n])   E ( s[n]  sˆ[n])
{ak }
n
2
n
Saˆ  s
 Can be solved using the famous Levinson Recursion, which leads
to lattice formulation of the linear prediction solution
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
Lec2 – Introduction 2/4/04 [43]
Source-Filter View of Speech Production
e(t)
E()
v(t)
r(t)
s(t)
V()
R()
S()
s(t) = e(t)*v(t)*r(t)
S() = E()V()R()
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
Lec2 – Introduction 2/4/04 [44]
UMCP ENEE408G Slides (created by R. Liu & M.Wu © 2002)
All-Pole Modeling of Speech

Auto-regressive (AR) model: all-pole filter
H ( z )    G ( z )V ( z ) R( z ) 

P
1   ak z

k

A( z )
k 1
– H(z) is the overall transfer function
– Glottal Flow G(z), Vocal Tract V(z), Radiation R(z), Gain 

Synthesis process:
u[n]: the vocal tract input, s[n]: speech output
u [n]

H ( z) 
A( z )
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
s[n]
Lec2 – Introduction 2/4/04 [45]
UMCP ENEE408G Slides (created by R. Liu & M.Wu © 2002)
All-Pole Model and Linear Prediction
S ( z)



U ( z ) A( z )

P
P
1   ak z  k
 S ( z )   ak S ( z ) z  k   U ( z )
k 1
k 1
 s[ n] 
P
a
k
k 1
Here sˆ[ n] 
s[ n  k ]   u[ n]
P
 a s[u  k ]
k 1
s[ n]
k
is a linear prediction of order P for s[n]
_
P( z )
e[n]
+
sˆ[ n ]
where
sˆ[ n]  e[ n]
+
e[n]  s[n]  sˆ[n] is the prediction error sequence
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
Lec2 – Introduction 2/4/04 [46]
UMCP ENEE408G Slides (created by R. Liu & M.Wu © 2002)
Model-based Coding

Linear Prediction Coder (LPC)
– LPC Vocoder ( voice coder )


Divide speech into frames (several tens milliseconds) and encode the
LPC coefficients of each frame
Additional parameters to facilitate synthesis: voiced/unvoiced flag,
gain, pitch (for voiced)
– Line Spectrum Pair (LSP) Coding

Hybrid Coding: LPC Residual Coding
– Between LPC and waveform coding
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
Lec2 – Introduction 2/4/04 [47]
UMCP ENEE408G Slides (created by R. Liu & M.Wu © 2002)
Line Spectrum Pair (LSP) Coding

Pros and Cons of LPC method
– Good performance at coding rate down to 2.4kbps
– Synthesized voice becomes unnatural when below 2.4kbps
– When the poles are near the unit circle, quantization in LPC coefficients
may result in instability.

LSP parameters
– LSP are frequencies extracted from polynomials constructed from LPC
coefficients
– Frequency domain features (similar to formant)
=> produce less distortion due to quantization
[See details in Design Project on Speech]
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
Lec2 – Introduction 2/4/04 [48]
UMCP ENEE408G Slides (created by R. Liu & M.Wu © 2002)
Hybrid Coding

“Hybrid” – between LPC and waveform coding
– LPC Residual Coding: encode and slowly update LPC coefficients, and
send the LPC residual (e.g. encoded using Vector Quantization)

Advantages:
– Free from quality degradation due to source modeling
– Low-frequency waveform is exactly reproduced
– Spectral information of the entire frequency range is preserved
– No need of pitch period estimation and voiced/unvoiced decision
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
Lec2 – Introduction 2/4/04 [49]
UMCP ENEE408G Slides (created by R. Liu & M.Wu © 2002)
Code-Excited Linear Predictive Coding (CELP)

Multipulse-Excised Linear Predictive Coding (MPC)
– Do not distinguish voiced/unvoiced sound explicitly

Code-Excited Linear Predictive Coding (CELP)
– Replace the multi-pulses of MPC with vector-quantized sequences based
on long-term prediction of periodicity and short-term prediction
Figure 6.32 of Furui’s book
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
Lec2 – Introduction 2/4/04 [52]
Table 6.1 of Furui’s book
UMCP ENEE408G Slides (created by R. Liu & M.Wu © 2002)
Speech Coding Methods
– Waveform coding; Hybrid coding; Analysis-synthesis coding
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
Lec2 – Introduction 2/4/04 [53]
UMCP ENEE408G Slides (created by R. Liu & M.Wu © 2002)
Speech Quality vs. Transmission Rate
Figure 6.2 of Furui’s book
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
Lec2 – Introduction 2/4/04 [54]
UMCP ENEE408G Slides (created by R. Liu & M.Wu © 2002)
Comparison of Different Speech Coding Tech.
Table 6.2 of Furui’s book
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
Lec2 – Introduction 2/4/04 [55]
UMCP ENEE408G Slides (created by M.Wu & R.Liu © 2002)
Put Together: A Digital Telephone System
– 8kHz and 8-bit per
sample for telephone
speech => 64kbps
– Anti-aliasing filter
before sampling
– Non-uniform quantization (e.g., through
-law or A-law
companding ~ signal
compression-expansion)
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
Lec2 – Introduction 2/4/04 [56]
UMCP ENEE408G Slides (created by M.Wu & R.Liu © 2002)
Speech Synthesis
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
Lec2 – Introduction 2/4/04 [57]
Figure 7.2 of Furui’s book
UMCP ENEE408G Slides (created by R. Liu & M.Wu © 2002)
Speech Synthesis

Speech synthesis: a process that artificially produces speech
– Articulatory synthesis, Formant synthesis, and LPC synthesis
– Issues other than synthesizer structure: text analysis, etc.
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
Lec2 – Introduction 2/4/04 [58]
Table 7.1 of Furui’s book
UMCP ENEE408G Slides (created by R.Liu © 2002)
Comparison of Synthesis Methods
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
Lec2 – Introduction 2/4/04 [59]
Figure 7.8 of Furui’s book
UMCP ENEE408G Slides (created by R. Liu & M.Wu © 2002)
Text-to-Speech Conversion System
=> See more in Design Project and try it out
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
Lec2 – Introduction 2/4/04 [60]
Analysis/Synthesis
Naturally spoken
utterance
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
Synthesized
utterance
Lec2 – Introduction 2/4/04 [61]
UMCP ENEE408G Slides (created by R. Liu & M.Wu © 2002)
Human Computer Interface/Interaction (HCI)

Multi-modal multimedia communications and interactions
– Info. & interface through speech/audio, image/video, graphics, etc.

Building blocks for speech based HCI
– Speech recognition and speaker identification
– Natural language understanding
– (Speech synthesis)
– Examples



voice command, dictation
Question-and-Answer: for intelligent customer service, voicebased info. retrieval, call routing, ……
Enhance speech-based HCI with graphics: “talking head”
=> See more in Design Project and try it out
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
Lec2 – Introduction 2/4/04 [62]
UMCP ENEE408G Slides (created by M.Wu & R.Liu © 2002)
Summary

Speech production and analysis
– Spectrogram; Pitch, Formant
– Linear prediction model

Speech coding
– Basic compression tools

Speech Synthesis

This week’s Lab session:
– Design project#1 on Speech

Next lecture: speech recognition
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
Lec2 – Introduction 2/4/04 [63]
UMCP ENEE408G Slides (created by M.Wu & R.Liu © 2002)
Assignments

“The Past, Present, and Future of Speech
Processing”

“Talk to the Machine”
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
Lec2 – Introduction 2/4/04 [64]
Download