Inharmonic Speech: A Tool for the Study of Speech Perception and Separation

advertisement
Inharmonic Speech:
A Tool for the Study
of Speech Perception and Separation
Josh McDermott
NYU/MIT
jhm@cns.nyu.edu
1.
2.
3.
4.
Dan Ellis
Hideki Kawahara
dpwe@ee.columbia.edu
kawahara@sys.wakayama-u.ac.jp
Columbia Univ.
Wakayama Univ.
What is inharmonic speech?
Why make inharmonic speech?
How to make inharmonic speech
Psychoacoustic experiments
Inharmonic Speech - McDermott, Ellis, Kawahara
2012-09-08
1 /12
The Structure of Speech
• Classic source/filter model
Formants
Pitch
t
Glottal pulse
train
Voiced/
unvoiced
Vocal tract
resonances
+
Radiation
characteristic
Speech
f
Frication
noise
t
Source
s(t)
Filter
(spectral envelope)
Inharmonic Speech - McDermott, Ellis, Kawahara
2012-09-08
2 /12
Harmonic Speech
• Periodic source pulses as a Fourier series:
(t
n )=
1
n=
1+
2 cos k
2
1
t
0.5
0
−0.5
k=1
0
0.5
1
1.5
2
2.5
1
0.5
0
−0.5
freq / kHz
−1
2
1.5
1
0.5
0
0
0.05
0.1
0.15
Inharmonic Speech - McDermott, Ellis, Kawahara
0.2
0.25
time / s
0.3
2012-09-08
3 /12
freq / kHz
Inharmonic Speech
2
Harmonic
fn = nf0
fn+1 fn = f0
1
0
2
Shifted by 0.3 f0
fn = nf0 + af0
fn+1 fn = f0
1
0
2
Stretched by 0.075 n(n-1) f 0
1
0
2
Jittered by 0.3 [-1..1] f 0
1
0
0
0.5
source
N
s(t) =
cos 2 fn t
n=1
fn = nf0 + b(n2 n)f0
fn+1 fn = (1 + 2bn)f0
fn = nf0 + crn f0 rn [ 1 . . . 1]
fn+1 fn = (1 + c rn )f0
1 time / s
Inharmonic Speech - McDermott, Ellis, Kawahara
2012-09-08
4 /12
Why Inharmonic Speech?
• Harmonicity is believed important
.. to the fusion of sounds in auditory organization
.. for pitch perception (prosody, speaker identity)
• Voiced speech has...
multiple (resolved) harmonics = “sparse” spectrum
.. with similar modulation properties
.. in a harmonic pattern
• How important is the “harmonic pattern”?
See how well people (& machines)
can organize and separate inharmonic speech
.. which is otherwise “natural”
maybe it’s enough to have a “sparse” spectrum?
Inharmonic Speech - McDermott, Ellis, Kawahara
2012-09-08
5 /12
freq / kHz
•
Harmonicity for Separation
Filtering of
harmonics
4
3
2
1
0
Denbigh & Zhao 1992
Avery Wang 1995
freq / kHz
after f0 is
found
4
3
2
1
0
• Labeling of
input
mixture
regions
by shared
f0 candidate
0
1
Front end
2
signal
features
(maps)
3
4
Object
formation
5
discrete
objects
6
Grouping
rules
7
time / s
8
Source
groups
freq
Brown 1992
Hu & Wang 2004
time
Inharmonic Speech - McDermott, Ellis, Kawahara
onset
period
frq.mod
2012-09-08
6 /12
Synthesizing Inharmonic Speech
• Based on STRAIGHT
Kawahara 1999, 2006 ...
pitch-adaptive
envelope estimation
250
200
aperiodicity
voiced/unvoiced
allocation
Inharmonic Speech - McDermott, Ellis, Kawahara
150
periodic
envelope
100
freq / kHz
f0, aperiodicity
estimation
300
3
2
1
0
noise
envelope
freq / kHz
Speech
f0
f0 / Hz
decompose speech into:
- f0 (pitch track)
- periodic envelope (voiced speech)
- noise envelope (unvoiced speech component)
3
2
1
0
0.5
1
2012-09-08
1.5
7 /12
2
STRAIGHT Synthesis
• STRAIGHT periodic source resynthesis
inharmonicity
parameters
4
3
2
1
0
freq / kHz
spectral
envelope
filtering
periodic
envelope
spectral
envelope
filtering
noise
envelope
random
excitation
4
3
2
1
0
+
freq / kHz
deterministic
excitation
4
3
2
1
0
0
freq / kHz
f0
freq / kHz
... as individual pitch pulses
... or as a set of Fourier components
- which can be made inharmonic
0.5
1
1.5
2
2.5
time / s
3
2
1
0
Inharmonic Speech - McDermott, Ellis, Kawahara
2012-09-08
8 /12
Inharmonic Speech Properties
• Periodicity index
0.14
Original
Harmonic Synth
Jittered Synth
Stretched Synth
Shifted Synth
Whispered Synth
calculated by Praat
histogram over
76 TIMIT utterances
Probabiilty of Occurence
0.12
0.1
0.08
0.06
0.04
0.02
0
0
300
Original
Harmonic Synth
Jittered Synth
Shifted Synth
f0 (Hz)
250
200
150
100
50 0
0.5
1
1.5
2
2.5
Time (s)
Inharmonic Speech - McDermott, Ellis, Kawahara
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Periodicity (Normalized Autocorrelation Peak Height)
• Pitch tracks
calculated by Praat
for an example
utterance
2012-09-08
9 /12
Psychological Experiment
• Idea: See impact of removing harmonicity
on ability to understand mixed words
• Listeners presented with one or two
simultaneous words or utterances
freq / Hz
freq / Hz
measure
accuracy at
identifying
all words
synthesized
as harmonic,
inharmonic,
or whisper
2000
2spkr female whisp ex7
2spkr female jit ex7
2spkr female shift ex7
2spkr female harm ex7
1500
1000
500
0
2000
1500
1000
500
0
4.5
5
Inharmonic Speech - McDermott, Ellis, Kawahara
5.5
time / s
6
4.5
5
5.5
time / s
2012-09-08 10/12
6
Results
• Harmonic tokens a little easier to understand
but inharmonic tokens much better than whispered
different types of inharmonicity seem equivalent
Spectral sparsity is a big contributor to separation?
Average Number of Correct Words − 6 Subjects
1
1
0.8
0.8
Mean # Correct Words
Mean # Correct Words
Average Number of Correct Words − 6 Subjects
0.6
0.4
0.2
0
Single Word
Mixture of Two Words
Orig
Harm
Inh−Shi Inh−Jit
Whisp
Inharmonic Speech - McDermott, Ellis, Kawahara
0.6
0.4
0.2
0
Orig
Harm
Inh−Jit
Inh−Shi
Whisp
Single Word Mixture of Two Words
2012-09-08 11/12
Conclusions
• Harmonicity of voice is thought to be
important for auditory scene analysis
but hard to separate harmonicity and sparsity
• Modified STRAIGHT framework produces
high-quality inharmonic tokens
excitation synthesized as sinusoids with arbitrary
frequency tracks
• Preliminary experiments show that
inharmonic tokens can still be separated
quantify contribution of harmonicity vs. sparsity
Inharmonic Speech - McDermott, Ellis, Kawahara
2012-09-08 12/12
Download