Lecture 5: Speech modeling

advertisement
EE E6820: Speech & Audio Processing & Recognition
Lecture 5:
Speech modeling
Dan Ellis <dpwe@ee.columbia.edu>
Michael Mandel <mim@ee.columbia.edu>
Columbia University Dept. of Electrical Engineering
http://www.ee.columbia.edu/∼dpwe/e6820
February 19, 2009
1
2
3
4
5
Modeling speech signals
Spectral and cepstral models
Linear predictive models (LPC)
Other signal models
Speech synthesis
E6820 (Ellis & Mandel)
L5: Speech modeling
February 19, 2009
1 / 46
Outline
1
Modeling speech signals
2
Spectral and cepstral models
3
Linear predictive models (LPC)
4
Other signal models
5
Speech synthesis
E6820 (Ellis & Mandel)
L5: Speech modeling
February 19, 2009
2 / 46
The speech signal
w
e
has a
^
hε z
tcl c θ
watch
I
n
I
thin
z
I
d
ay
as a
m
dime
Elements of the speech signal
spectral resonances (formants, moving)
periodic excitation (voicing, pitched) + pitch contour
noise excitation
transients (stop-release bursts)
amplitude modulation (nasals, approximants)
timing!
E6820 (Ellis & Mandel)
L5: Speech modeling
February 19, 2009
3 / 46
The source-filter model
Notional separation of
source: excitation, fine time-frequency structure
filter: resonance, broad spectral structure
Formants
Pitch
t
Glottal pulse
train
Voiced/
unvoiced
Vocal tract
resonances
+
Radiation
characteristic
Speech
f
Frication
noise
t
Source
Filter
More a modeling approach than a single model
E6820 (Ellis & Mandel)
L5: Speech modeling
February 19, 2009
4 / 46
Signal modeling
Signal models are a kind of representation
I
I
I
to make some aspect explicit
for efficiency
for flexibility
Nature of model depends on goal
I
I
I
classification: remove irrelevant details
coding/transmission: remove perceptual irrelevance
modification: isolate control parameters
But commonalities emerge
I
I
I
perceptually irrelevant detail (coding) will also be irrelevant for
classification
modification domain will usually reflect ‘independent’
perceptual attributes
getting at the abstract information in the signal
E6820 (Ellis & Mandel)
L5: Speech modeling
February 19, 2009
5 / 46
Different influences for signal models
Receiver
I
see how signal is treated by listeners
→ cochlea-style filterbank models . . .
Transmitter (source)
I
physical vocal apparatus can generate only a limited range of
signals . . .
→ LPC models of vocal tract resonances
Making explicit particular aspects
I
compact, separable correlates of resonances
→ cepstrum
I
modeling prominent features of NB spectrogram
I
addressing unnaturalness in synthesis
→ sinusoid models
→ Harmonic+noise model
E6820 (Ellis & Mandel)
L5: Speech modeling
February 19, 2009
6 / 46
Application of (speech) signal models
Classification / matching
Goal: highlight important information
I
I
I
I
speech recognition (lexical content)
speaker recognition (identity or class)
other signal classification
content-based retrieval
Coding / transmission / storage
Goal: represent just enough information
I
I
real-time transmission, e.g. mobile phones
archive storage, e.g. voicemail
Modification / synthesis
Goal: change certain parts independently
I
I
speech synthesis / text-to-speech (change the words)
speech transformation / disguise (change the speaker)
E6820 (Ellis & Mandel)
L5: Speech modeling
February 19, 2009
7 / 46
Outline
1
Modeling speech signals
2
Spectral and cepstral models
3
Linear predictive models (LPC)
4
Other signal models
5
Speech synthesis
E6820 (Ellis & Mandel)
L5: Speech modeling
February 19, 2009
8 / 46
Spectral and cepstral models
Spectrogram seems like a good representation
I
I
I
long history
satisfying in use
experts can ‘read’ the speech
What is the information?
I
I
intensity in time-frequency cells
typically 5ms × 200 Hz × 50 dB
→ Discarded detail:
I
I
phase
fine-scale timing
The starting point for other representations
E6820 (Ellis & Mandel)
L5: Speech modeling
February 19, 2009
9 / 46
Short-time Fourier transform (STFT) as filterbank
View spectrogram rows as coming from separate bandpass filters
f
sound
Mathematically:
X [k, n0 ] =
X
=
X
n
2πk(n − n0 )
x[n]w [n − n0 ] exp −j
N
x[n]hk [n0 − n]
n
hk[n]
where hk [n] = w [−n] exp
j 2πkn
N
Hk(ejω)
w[-n]
W(ej(ω − 2πk/N))
n
2πk/N
E6820 (Ellis & Mandel)
L5: Speech modeling
February 19, 2009
ω
10 / 46
Spectral models: which bandpass filters?
Constant bandwidth? (analog / FFT)
But: cochlea physiology & critical bandwidths
→ implement ear models with bandpass filters & choose
bandwidths by e.g. CB estimates
Auditory frequency scales
I
constant ‘Q’ (center freq / bandwidth), mel, Bark, . . .
E6820 (Ellis & Mandel)
L5: Speech modeling
February 19, 2009
11 / 46
Gammatone filterbank
Given bandwidths, which filter shapes?
I
I
I
match inferred temporal integration window
match inferred spectral shape (sharp high-freq slope)
keep it simple (since it’s only approximate)
→ Gammatone filters
I
I
2N poles, 2 zeros, low complexity
reasonable linear match to cochlea
h[n] = nN−1 e −bn cos(ωi n)
0
2
-10
2
2
2
mag / dB
z plane
-20
-30
-40
-50
E6820 (Ellis & Mandel)
time →
50
100
200
L5: Speech modeling
500
1000
2000
5000
freq / Hz
February 19, 2009
12 / 46
Constant-BW vs. cochlea model
Frequency responses
Spectrograms
Effective FFT filterbank
0
freq / Hz
Gain / dB
-10
-20
-30
-40
-50
0
1000
2000
3000
4000
5000
6000
7000
2000
0.5
1
1.5
2
2.5
3
Q=4 4 pole 2 zero cochlea model downsampled @ 64
5000
2000
freq / Hz
Gain / dB
4000
0
0
8000
-10
-20
-30
-40
-50
6000
Gammatone filterbank
0
FFT-based WB spectrogram (N=128)
8000
1000
500
200
100
0
1000
2000
3000
4000
5000
6000
7000
8000
Freq / Hz
0
0.5
1
1.5
2
2.5
3
time / s
Magnitude smoothed over 5-20 ms time window
E6820 (Ellis & Mandel)
L5: Speech modeling
February 19, 2009
13 / 46
Limitations of spectral models
Not much data thrown away
I
I
I
just fine phase / time structure (smoothing)
little actual ‘modeling’
still a large representation
Little separation of features
e.g. formants and pitch
Highly correlated features
I
modifications affect multiple parameters
But, quite easy to reconstruct
I
iterative reconstruction of lost phase
E6820 (Ellis & Mandel)
L5: Speech modeling
February 19, 2009
14 / 46
The cepstrum
Original motivation: assume a source-filter model:
Excitation
source g[n]
Resonance
filter H(ejω)
n
ω
n
Define ‘Homomorphic deconvolution’:
source-filter convolution
g [n] ∗ h[n]
FT → product
G (e jω )H(e jω )
log → sum
log G (e jω ) + log H(e jω )
IFT → separate fine structure cg [n] + ch [n]
= deconvolution
Definition
Real cepstrum cn = idft (log |dft(x[n])|)
E6820 (Ellis & Mandel)
L5: Speech modeling
February 19, 2009
15 / 46
Stages in cepstral deconvolution
Original waveform has excitation
fine structure convolved with
resonances
DFT shows harmonics modulated
by resonances
Waveform and min. phase IR
0.2
0
-0.2
20
Log DFT is sum of harmonic
‘comb’ and resonant bumps
10
IDFT separates out resonant
bumps (low quefrency) and regular,
fine structure (‘pitch pulse’)
dB
0
Selecting low-n cepstrum separates
resonance information
(deconvolution / ‘liftering’)
0
0
L5: Speech modeling
100
200
300
400
abs(dft) and liftered
1000
2000
3000
log(abs(dft)) and liftered
samps
freq / Hz
-20
-40
0
200
1000
2000
3000
real cepstrum and lifter
freq / Hz
pitch pulse
100
0
0
E6820 (Ellis & Mandel)
0
100
200
quefrency
February 19, 2009
16 / 46
Properties of the cepstrum
Separate source (fine) from filter (broad structure)
I
smooth the log magnitude spectrum to get resonances
Smoothing spectrum is filtering along frequency
i.e. convolution applied in Fourier domain
→ multiplication in IFT (‘liftering’)
Periodicity in time → harmonics in spectrum → ‘pitch pulse’
in high-n cepstrum
Low-n cepstral coefficients are DCT of broad filter /
resonance shape
Z
dω
cn = log X (e jω ) (cos nω + j sin
nω)
5th order Cepstral reconstruction
Cepstral coefs 0..5
0.1
2
1
0
0
-1
0
1
2
3
4
5
-0.1
0
E6820 (Ellis & Mandel)
1000
2000
3000
L5: Speech modeling
4000
5000
6000
7000
February 19, 2009
17 / 46
Aside: correlation of elements
Cepstrum is popular in speech recognition
I
feature vector elements are decorrelated
Cepstral
coefficients
Auditory
spectrum
Features
20
20
16
15
12
10
8
5
4
Example joint distrib (10,15)
-2
-3
-4
-5
20
18
16
14
12
10
8
6
4
2
16
12
8
4
50
I
Covariance matrix
25
100
150 frames
5
10
15
20
3
2
1
0
-1
-2
-3
-4
-5
0
5
c0 ‘normalizes out’ average log energy
Decorrelated pdfs fit diagonal Gaussians
I
simple correlation is a waste of parameters
DCT is close to PCA for (mel) spectra?
E6820 (Ellis & Mandel)
L5: Speech modeling
February 19, 2009
18 / 46
Outline
1
Modeling speech signals
2
Spectral and cepstral models
3
Linear predictive models (LPC)
4
Other signal models
5
Speech synthesis
E6820 (Ellis & Mandel)
L5: Speech modeling
February 19, 2009
19 / 46
Linear predictive modeling (LPC)
LPC is a very successful speech model
I
I
I
it is mathematically efficient (IIR filters)
it is remarkably accurate for voice (fits source-filter distinction)
it has a satisfying physical interpretation (resonances)
Basic math
I
model output as linear function of prior outputs:
!
p
X
s[n] =
ak s[n − k] + e[n]
k=1
I
. . . hence “linear prediction” (p th order)
e[n] is excitation (input), AKA prediction error
⇒
1
S(z)
1
Pp
=
=
−k
E (z)
A(z)
1 − k=1 ak z
. . . all-pole modeling, ‘autoregression’ (AR) model
E6820 (Ellis & Mandel)
L5: Speech modeling
February 19, 2009
20 / 46
Vocal tract motivation for LPC
Direct expression of source-filter model
!
p
X
s[n] =
ak s[n − k] + e[n]
k=1
Pulse/noise
excitation
e[n]
Vocal tract
H(z) = 1/A(z)
s[n]
|H(ejω)|
H(z)
f
z-plane
Acoustic tube models suggest all-pole model for vocal tract
Relatively slowly-changing
I
update A(z) every 10-20 ms
Not perfect: Nasals introduce zeros
E6820 (Ellis & Mandel)
L5: Speech modeling
February 19, 2009
21 / 46
Estimating LPC parameters
Minimize short-time squared prediction error
E=
m
X
2
e [n] =
X
s[n] −
n
n=1
p
X
!2
ak s[n − k]
k=1
Differentiate w.r.t. ak to get equations for each k:


p
X
X
0=
2 s[n] −
aj s[n − j] (−s[n − k])
n
X
s[n]s[n − k] =
n
φ(0, k) =
j=1
X
aj
X
s[n − j]s[n − k]
j
n
X
aj φ(j, k)
j
where φ(j, k) =
coefficients
I
Pm
n=1 s[n
− j]s[n − k] are correlation
p linear equations to solve for all aj s . . .
E6820 (Ellis & Mandel)
L5: Speech modeling
February 19, 2009
22 / 46
Evaluating parameters
P
Linear equations φ(0, k) = pj=1 aj φ(j, k)
If s[n] is assumed to be zero outside of some window
X
φ(j, k) =
s[n − j]s[n − k] = rss (|j − k|)
n
I
rss (τ ) is autocorrelation
Hence equations become:

 
r (1)
r (0)
r (1)
···
 r (2)   r (1)
r (2)
···

 
 ..  = 
..
..
..
 .  
.
.
.
r (p)
r (p − 1) r (p − 2) · · ·
r (p − 1)
r (p − 2)
..
.
r (0)





a1
a2
..
.





ap
Toeplitz matrix (equal antidiagonals)
→ can use Durbin recursion to solve
(Solve full φ(j, k) via Cholesky)
E6820 (Ellis & Mandel)
L5: Speech modeling
February 19, 2009
23 / 46
LPC illustration
0.1
0
windowed original
-0.1
-0.2
LPC residual
-0.3
0
dB
50
100
150
200
250
300
350
400
time / samp
original spectrum
0
LPC spectrum
-20
-40
-60
residual spectrum
0
1000
2000
3000
4000
5000
6000
7000
freq / Hz
Actual poles
z-plane
E6820 (Ellis & Mandel)
L5: Speech modeling
February 19, 2009
24 / 46
Interpreting LPC
Picking out resonances
I
if signal really was source + all-pole resonances, LPC should
find the resonances
Least-squares fit to spectrum
minimizing e 2 [n] in time domain is the same as minimizing
E 2 (e jω ) by Parseval
→ close fit to spectral peaks; valleys don’t matter
I
Removing smooth variation in spectrum
1
A(z)
S(z)
I
E (z)
I
I
is a low-order approximation to S(z)
1
= A(z)
hence, residual E (z) = A(z)S(z) is a ‘flat’ version of S
Signal whitening:
white noise (independent x[n]s) has flat spectrum
→ whitening removes temporal correlation
I
E6820 (Ellis & Mandel)
L5: Speech modeling
February 19, 2009
25 / 46
Alternative LPC representations
Many alternate p-dimensional representations
I
I
I
I
I
coefficients {a
P
Qj }
1 − λj z −j = 1 − aj z −1
roots {λj }:
line spectrum frequencies. . .
reflection coefficients {kj } from lattice
form
tube model log area ratios gj = log
1−kj
1+kj
Choice depends on:
I
I
I
I
I
mathematical convenience / complexity
quantization sensitivity
ease of guaranteeing stability
what is made explicit
distributions as statistics
E6820 (Ellis & Mandel)
L5: Speech modeling
February 19, 2009
26 / 46
LPC applications
Analysis-synthesis (coding, transmission)
(z)
S(z) = EA(z)
hence can reconstruct by filtering e[n] with {aj }s
I whitened, decorrelated, minimized e[n]s are easy to quantize
. . . or can model e[n] e.g. as simple pulse train
I
Recognition / classification
I
I
LPC fit responds to spectral peaks (formants)
can use for recognition (convert to cepstra?)
Modification
I
I
separating source and filter supports cross-synthesis
pole / resonance model supports ‘warping’
e.g. male → female
E6820 (Ellis & Mandel)
L5: Speech modeling
February 19, 2009
27 / 46
Aside: Formant tracking
Formants carry (most?) linguistic information
Why not classify → speech recognition?
e.g. local maxima in cepstral-liftered spectrum pole frequencies in
LPC fit
But: recognition needs to work in all circumstances
I
formants can be obscured or undefined
Original (mpgr1_sx419)
freq / Hz
4000
3000
2000
1000
0
Noise-excited LPC resynthesis with pole freqs
freq / Hz
4000
3000
2000
1000
0
0
0.2
0.4
0.6
0.8
1
1.2
1.4
time / s
→ need more graceful, robust parameters
E6820 (Ellis & Mandel)
L5: Speech modeling
February 19, 2009
28 / 46
Outline
1
Modeling speech signals
2
Spectral and cepstral models
3
Linear predictive models (LPC)
4
Other signal models
5
Speech synthesis
E6820 (Ellis & Mandel)
L5: Speech modeling
February 19, 2009
29 / 46
Sinusoid modeling
Early signal models required low complexity
e.g. LPC
Advances in hardware open new possibilities. . .
freq / Hz
NB spectrogram suggests harmonics model
4000
3000
2000
1000
0
I
I
I
0
0.5
1
time / s
1.5
‘important’ info in 2D surface is set of tracks?
harmonic tracks have ∼smooth properties
straightforward resynthesis
E6820 (Ellis & Mandel)
L5: Speech modeling
February 19, 2009
30 / 46
Sine wave models
Model sound as sum of AM/FM sinusoids
N[n]
s[n] =
X
Ak [n] cos(n ωk [n] + φk [n])
k=1
I
I
Ak , ωk , φk piecewise linear or constant
can enforce harmonicity: ωk = kω0
Extract parameters directly from STFT frames:
time
mag
freq
I
I
find local maxima of |S[k, n]| along frequency
track birth/death and correspondence
E6820 (Ellis & Mandel)
L5: Speech modeling
February 19, 2009
31 / 46
Finding sinusoid peaks
Look for local maxima along DFT frame
i.e. |s[k − 1, n]| < |S[k, n]| > |S[k + 1, n]|
Want exact frequency of implied sinusoid
DFT is normally quantized quite coarsely
e.g. 4000 Hz / 256 bands = 15.6 Hz/band
I
quadratic fit to 3 points
magnitude
interpolated frequency
and magnitude
spectral samples
frequency
I
may also need interpolated unwrapped phase
Or, use differential of phase along time (pvoc):
ω=
E6820 (Ellis & Mandel)
aḃ − b ȧ
a2 + b 2
where S[k, n] = a + jb
L5: Speech modeling
February 19, 2009
32 / 46
Sinewave modeling applications
Modification (interpolation) and synthesis
I
connecting arbitrary ω and φ requires cubic phase interpolation
(because ω = φ̇)
Types of modification
I
time and frequency scale modification
. . . with or without changing formant envelope
I
I
concatenation / smoothing boundaries
phase realignment (for crest reduction)
freq / Hz
Non-harmonic signals? OK-ish
4000
3000
2000
1000
0
E6820 (Ellis & Mandel)
0
0.5
L5: Speech modeling
1
time / s
1.5
February 19, 2009
33 / 46
Harmonics + noise model
Motivation to improve sinusoid model because
I
I
I
problems with analysis of real (noisy) signals
problems with synthesis quality (esp. noise)
perceptual suspicions
Model
N[n]
s[n] =
X
k=1
I
I
Ak [n] cos(nkω0 [n]) + e[n](hn [n] ∗ b[n])
|
{z
} |
{z
}
Harmonics
Noise
sinusoids are forced to be harmonic
remainder is filtered and time-shaped noise
‘Break frequency’ Fm [n] between H and N
dB
40
Harmonics Harmonicity limit
Fm[n]
Noise
20
0
0
E6820 (Ellis & Mandel)
1000
2000
L5: Speech modeling
3000
freq / Hz
February 19, 2009
34 / 46
HNM analysis and synthesis
freq / Hz
Dynamically adjust Fm [n] based on ‘harmonic test’:
4000
3000
2000
1000
0
0
0.5
1
time / s
1.5
Noise has envelopes in time e[n] and frequency Hn
0
0
1000
dB
40
2000
Hn[k]
freq / Hz
3000
e[n]
0
0.01
0.02
0.03 time / s
reconstruct bursts / synchronize to pitch pulses
E6820 (Ellis & Mandel)
L5: Speech modeling
February 19, 2009
35 / 46
Outline
1
Modeling speech signals
2
Spectral and cepstral models
3
Linear predictive models (LPC)
4
Other signal models
5
Speech synthesis
E6820 (Ellis & Mandel)
L5: Speech modeling
February 19, 2009
36 / 46
Speech synthesis
One thing you can do with models
Synthesis easier than recognition?
I listeners do the work
. . . but listeners are very critical
Overview of synthesis
text
Phoneme
generation
Text
normalization
Synthesis
algorithm
speech
Prosody
generation
normalization disambiguates text (abbreviations)
phonetic realization from pronunciation dictionary
I prosodic synthesis by rule (timing, pitch contour)
. . . all control waveform generation
I
I
E6820 (Ellis & Mandel)
L5: Speech modeling
February 19, 2009
37 / 46
Source-filter synthesis
Flexibility of source-filter model is ideal for speech synthesis
Pitch
info
Voiced/
unvoiced
t
Glottal pulse
source
Phoneme
info
th ax k ae t
t
Vocal tract
filter
+
t
Speech
Noise
source
t
Excitation source issues
voiced / unvoiced / mixture ([th] etc.)
pitch cycles of voiced segments
glottal pulse shape → voice quality?
E6820 (Ellis & Mandel)
L5: Speech modeling
February 19, 2009
38 / 46
Vocal tract modeling
freq
Simplest idea: store a single VT model for each phoneme
time
th
ax
k
ae
t
but discontinuities are very unnatural
freq
Improve by smoothing between templates
time
th
ax
k
ae
t
trick is finding the right domain
E6820 (Ellis & Mandel)
L5: Speech modeling
February 19, 2009
39 / 46
Cepstrum-based synthesis
Low-n cepstrum is compact model of target spectrum
Can invert to get actual VT IR waveforms:
cn = idft(log |dft(x[n])|)
⇒ h[n] = idft(exp(dft(cn )))
All-zero (FIR) VT response
→ can pre-convolve with glottal pulses
Glottal pulse
inventory
Pitch pulse times (from pitch contour)
ee
ae
time
ah
I
cross-fading between templates OK
E6820 (Ellis & Mandel)
L5: Speech modeling
February 19, 2009
40 / 46
LPC-based synthesis
Very compact representation of target spectra
I
3 or 4 pole pairs per template
Low-order IIR filter → very efficient synthesis
How to interpolate?
cannot just interpolate aj in a running filter
but lattice filter has better-behaved interpolation
+
s[n]
a1
z-1
a2
z-1
a3
z-1
e[n] +
+
+
kp-1
z-1
-
+
e[n]
z-1
-
+
I
I
s[n]
-1
k0 z
-1
What to use for excitation
I
I
I
residual from original analysis
reconstructed periodic pulse train
parametrized residual resynthesis
E6820 (Ellis & Mandel)
L5: Speech modeling
February 19, 2009
41 / 46
Diphone synethsis
Problems in phone-concatenation synthesis
I
I
I
phonemes are context-dependent
coarticulation is complex
transitions are critical to perception
→ store transitions instead of just phonemes
e
hε z
w
^
Phones
tcl c θ
I
n
I
z
I
d
ay
m
Diphone
segments
I
I
∼ 40 phones ⇒ ∼ 800 diphones
or even more context if have larger database
How to splice diphones together?
I
I
TD-PSOLA: align pitch pulses and cross fade
MBROLA: normalized multiband
E6820 (Ellis & Mandel)
L5: Speech modeling
February 19, 2009
42 / 46
HNM synthesis
High quality resynthesis of real diphone units + parametric
representation for modification
I
I
pitch, timing modifications
removal of discontinuities at boundaries
Synthesis procedure
I
I
I
I
linguistic processing gives phones, pitch, timing
database search gives best-matching units
use HNM to fine-tune pitch and timing
cross-fade Ak and ω0 parameters at boundaries
freq
time
Careful preparation of database is key
I
I
sine models allow phase alignment of all units
larger database improves unit match
E6820 (Ellis & Mandel)
L5: Speech modeling
February 19, 2009
43 / 46
Generating prosody
The real factor limiting speech synthesis?
Waveform synthesizers have inputs for
I
I
I
intensity (stress)
duration (phrasing)
fundamental frequency (pitch)
Curves produced by superposition of (many) inferred linguistic
rules
I
phrase final lengthening, unstressed shortening, . . .
Or learn rules from transcribed elements
E6820 (Ellis & Mandel)
L5: Speech modeling
February 19, 2009
44 / 46
Summary
Range of models
I
I
spectral, cepstral
LPC, sinusoid, HNM
Range of applications
I
I
I
general spectral shape (filterbank) → ASR
precise description (LPC + residual) → coding
pitch, time modification (HNM) → synthesis
Issues
I
I
I
performance vs computational complexity
generality vs accuracy
representation size vs quality
Parting thought
not all parameters are created equal. . .
E6820 (Ellis & Mandel)
L5: Speech modeling
February 19, 2009
45 / 46
References
Alan V. Oppenheim. Speech analysis-synthesis system based on homomorphic
filtering. The Journal of the Acoustical Society of America, 45(1):309–309, 1969.
J. Makhoul. Linear prediction: A tutorial review. Proceedings of the IEEE, 63(4):
561–580, 1975.
Bishnu S. Atal and Suzanne L. Hanauer. Speech analysis and synthesis by linear
prediction of the speech wave. The Journal of the Acoustical Society of America,
50(2B):637–655, 1971.
J.E. Markel and AH Gray. Linear Prediction of Speech. Springer-Verlag New York,
Inc., Secaucus, NJ, USA, 1982.
R. McAulay and T. Quatieri. Speech analysis/synthesis based on a sinusoidal
representation. Acoustics, Speech, and Signal Processing [see also IEEE
Transactions on Signal Processing], IEEE Transactions on, 34(4):744–754, 1986.
Wael Hamza, Ellen Eide, Raimo Bakis, Michael Picheny, and John Pitrelli. The IBM
expressive speech synthesis system. In INTERSPEECH, pages 2577–2580, October
2004.
E6820 (Ellis & Mandel)
L5: Speech modeling
February 19, 2009
46 / 46
Download