Inharmonic Speech: A Tool for the Study of Speech Perception and Separation Josh McDermott NYU/MIT jhm@cns.nyu.edu 1. 2. 3. 4. Dan Ellis Hideki Kawahara dpwe@ee.columbia.edu kawahara@sys.wakayama-u.ac.jp Columbia Univ. Wakayama Univ. What is inharmonic speech? Why make inharmonic speech? How to make inharmonic speech Psychoacoustic experiments Inharmonic Speech - McDermott, Ellis, Kawahara 2012-09-08 1 /12 The Structure of Speech • Classic source/filter model Formants Pitch t Glottal pulse train Voiced/ unvoiced Vocal tract resonances + Radiation characteristic Speech f Frication noise t Source s(t) Filter (spectral envelope) Inharmonic Speech - McDermott, Ellis, Kawahara 2012-09-08 2 /12 Harmonic Speech • Periodic source pulses as a Fourier series: (t n )= 1 n= 1+ 2 cos k 2 1 t 0.5 0 −0.5 k=1 0 0.5 1 1.5 2 2.5 1 0.5 0 −0.5 freq / kHz −1 2 1.5 1 0.5 0 0 0.05 0.1 0.15 Inharmonic Speech - McDermott, Ellis, Kawahara 0.2 0.25 time / s 0.3 2012-09-08 3 /12 freq / kHz Inharmonic Speech 2 Harmonic fn = nf0 fn+1 fn = f0 1 0 2 Shifted by 0.3 f0 fn = nf0 + af0 fn+1 fn = f0 1 0 2 Stretched by 0.075 n(n-1) f 0 1 0 2 Jittered by 0.3 [-1..1] f 0 1 0 0 0.5 source N s(t) = cos 2 fn t n=1 fn = nf0 + b(n2 n)f0 fn+1 fn = (1 + 2bn)f0 fn = nf0 + crn f0 rn [ 1 . . . 1] fn+1 fn = (1 + c rn )f0 1 time / s Inharmonic Speech - McDermott, Ellis, Kawahara 2012-09-08 4 /12 Why Inharmonic Speech? • Harmonicity is believed important .. to the fusion of sounds in auditory organization .. for pitch perception (prosody, speaker identity) • Voiced speech has... multiple (resolved) harmonics = “sparse” spectrum .. with similar modulation properties .. in a harmonic pattern • How important is the “harmonic pattern”? See how well people (& machines) can organize and separate inharmonic speech .. which is otherwise “natural” maybe it’s enough to have a “sparse” spectrum? Inharmonic Speech - McDermott, Ellis, Kawahara 2012-09-08 5 /12 freq / kHz • Harmonicity for Separation Filtering of harmonics 4 3 2 1 0 Denbigh & Zhao 1992 Avery Wang 1995 freq / kHz after f0 is found 4 3 2 1 0 • Labeling of input mixture regions by shared f0 candidate 0 1 Front end 2 signal features (maps) 3 4 Object formation 5 discrete objects 6 Grouping rules 7 time / s 8 Source groups freq Brown 1992 Hu & Wang 2004 time Inharmonic Speech - McDermott, Ellis, Kawahara onset period frq.mod 2012-09-08 6 /12 Synthesizing Inharmonic Speech • Based on STRAIGHT Kawahara 1999, 2006 ... pitch-adaptive envelope estimation 250 200 aperiodicity voiced/unvoiced allocation Inharmonic Speech - McDermott, Ellis, Kawahara 150 periodic envelope 100 freq / kHz f0, aperiodicity estimation 300 3 2 1 0 noise envelope freq / kHz Speech f0 f0 / Hz decompose speech into: - f0 (pitch track) - periodic envelope (voiced speech) - noise envelope (unvoiced speech component) 3 2 1 0 0.5 1 2012-09-08 1.5 7 /12 2 STRAIGHT Synthesis • STRAIGHT periodic source resynthesis inharmonicity parameters 4 3 2 1 0 freq / kHz spectral envelope filtering periodic envelope spectral envelope filtering noise envelope random excitation 4 3 2 1 0 + freq / kHz deterministic excitation 4 3 2 1 0 0 freq / kHz f0 freq / kHz ... as individual pitch pulses ... or as a set of Fourier components - which can be made inharmonic 0.5 1 1.5 2 2.5 time / s 3 2 1 0 Inharmonic Speech - McDermott, Ellis, Kawahara 2012-09-08 8 /12 Inharmonic Speech Properties • Periodicity index 0.14 Original Harmonic Synth Jittered Synth Stretched Synth Shifted Synth Whispered Synth calculated by Praat histogram over 76 TIMIT utterances Probabiilty of Occurence 0.12 0.1 0.08 0.06 0.04 0.02 0 0 300 Original Harmonic Synth Jittered Synth Shifted Synth f0 (Hz) 250 200 150 100 50 0 0.5 1 1.5 2 2.5 Time (s) Inharmonic Speech - McDermott, Ellis, Kawahara 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Periodicity (Normalized Autocorrelation Peak Height) • Pitch tracks calculated by Praat for an example utterance 2012-09-08 9 /12 Psychological Experiment • Idea: See impact of removing harmonicity on ability to understand mixed words • Listeners presented with one or two simultaneous words or utterances freq / Hz freq / Hz measure accuracy at identifying all words synthesized as harmonic, inharmonic, or whisper 2000 2spkr female whisp ex7 2spkr female jit ex7 2spkr female shift ex7 2spkr female harm ex7 1500 1000 500 0 2000 1500 1000 500 0 4.5 5 Inharmonic Speech - McDermott, Ellis, Kawahara 5.5 time / s 6 4.5 5 5.5 time / s 2012-09-08 10/12 6 Results • Harmonic tokens a little easier to understand but inharmonic tokens much better than whispered different types of inharmonicity seem equivalent Spectral sparsity is a big contributor to separation? Average Number of Correct Words − 6 Subjects 1 1 0.8 0.8 Mean # Correct Words Mean # Correct Words Average Number of Correct Words − 6 Subjects 0.6 0.4 0.2 0 Single Word Mixture of Two Words Orig Harm Inh−Shi Inh−Jit Whisp Inharmonic Speech - McDermott, Ellis, Kawahara 0.6 0.4 0.2 0 Orig Harm Inh−Jit Inh−Shi Whisp Single Word Mixture of Two Words 2012-09-08 11/12 Conclusions • Harmonicity of voice is thought to be important for auditory scene analysis but hard to separate harmonicity and sparsity • Modified STRAIGHT framework produces high-quality inharmonic tokens excitation synthesized as sinusoids with arbitrary frequency tracks • Preliminary experiments show that inharmonic tokens can still be separated quantify contribution of harmonicity vs. sparsity Inharmonic Speech - McDermott, Ellis, Kawahara 2012-09-08 12/12