EE E6820: Speech & Audio Processing & Recognition Lecture 1: Introduction & DSP Dan Ellis <dpwe@ee.columbia.edu> Mike Mandel <mim@ee.columbia.edu> Columbia University Dept. of Electrical Engineering http://www.ee.columbia.edu/∼dpwe/e6820 January 22, 2009 1 Sound and information 2 Course Structure 3 DSP review: Timescale modification Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 1 / 33 Outline 1 Sound and information 2 Course Structure 3 DSP review: Timescale modification Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 2 / 33 Sound and information Sound is air pressure variation Mechanical vibration Pressure waves in air Motion of sensor ++++ v(t) Time-varying voltage t Transducers convert air pressure ↔ voltage Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 3 / 33 What use is sound? Footsteps examples: 0.5 0 -0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.5 1 1.5 2 2.5 time / s 3 3.5 4 4.5 5 0.5 0 -0.5 Hearing confers an evolutionary advantage useful information, complements vision . . . at a distance, in the dark, around corners listeners are highly adapted to ‘natural sounds’ (including speech) Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 4 / 33 The scope of audio processing Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 5 / 33 The acoustic communication chain message signal channel receiver decoder ! synthesis audio processing recognition Sound is an information bearer Received sound reflects source(s) plus effect of environment (channel) Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 6 / 33 Levels of abstraction Much processing concerns shifting between levels of abstraction abstract representation (e.g. t-f energy) Synthesis Analysis ‘information’ sound p(t) concrete Different representations serve different tasks separating aspects, making things explicit, . . . Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 7 / 33 Outline 1 Sound and information 2 Course Structure 3 DSP review: Timescale modification Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 8 / 33 Source structure Goals I I I survey topics in sound analysis & processing develop and intuition for sound signals learn some specific technologies Course structure I I I weekly assignments (25%) midterm event (25%) final project (50%) Text Speech and Audio Signal Processing Ben Gold & Nelson Morgan Wiley, 2000 ISBN: 0-471-35154-7 Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 9 / 33 Web-based Course website: http://www.ee.columbia.edu/∼dpwe/e6820/ for lecture notes, problem sets, examples, . . . + student web pages for homework, etc. Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 10 / 33 Course outline Fundamentals L1: DSP L2: Acoustics Audio processing L5: Signal models Applications L6: Music analysis/ synthesis L8: L7: Spatial sound Audio compression & rendering Dan Ellis (Ellis & Mandel) L4: L3: Auditory Pattern recognition perception L9: Speech recognition L10: Music retrieval L11: Signal separation L12: Multimedia indexing Intro & DSP January 22, 2009 11 / 33 Weekly assignments Research papers I I I journal & conference publications summarize & discuss in class written summaries on web page + Courseworks discussion Practical experiments I I I Matlab-based (+ Signal Processing Toolbox) direct experience of sound processing skills for project Book sections Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 12 / 33 Final project Most significant part of course (50%) of grade Oral proposals mid-semester; Presentations in final class + website Scope I I I practical (Matlab recommended) identify a problem; try some solutions evaluation Topic I I I few restrictions within world of audio investigate other resources develop in discussion with me Citation & plagiarism Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 13 / 33 Examples of past projects Automatic prosody classification Dan Ellis (Ellis & Mandel) Model-based note transcription Intro & DSP January 22, 2009 14 / 33 Outline 1 Sound and information 2 Course Structure 3 DSP review: Timescale modification Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 15 / 33 DSP review: digital signals Discrete-time sampling limits bandwidth xd[n] = Q( xc(nT ) ) Discrete-level quantization limits dynamic range time ε T sampling interval T sampling frequency ΩT = quantizer Q(y ) = y Dan Ellis (Ellis & Mandel) 2π T Intro & DSP January 22, 2009 16 / 33 The speech signal: time domain Speech is a sequence of different sound types Vowel: periodic “has” .1 0 0 .1 Fricative: aperiodic “watch” 0.05 1.38 1.4 -0.05 1.86 1.42 1.88 1.9 1.92 0.2 0.1 0 -0.1 -0.2 1.4 1.6 has a 1.8 2 2.2 watch thin 2.4 2.6 as a time/s dime 0.1 0.02 0 -0.1 1.52 0 -0.02 1.54 1.56 1.58 Glide: smooth transition “watch” Dan Ellis (Ellis & Mandel) Intro & DSP 2.42 2.44 2.46 2.4 Stop burst: transient “dime” January 22, 2009 17 / 33 Timescale modification (TSM) Can we modify a sound to make it ‘slower’ ? i.e. speech pronounced more slowly e.g. to help comprehension, analysis or more quickly for ‘speed listening’ ? Why not just slow it down? xs (t) = xo ( rt ), r = slowdown factor (> 1 → slower) equivalent to playback at a different sampling rate 0.1 0.05 Original 0 -0.05 -0.1 2.35 2.4 2.45 2.4 2.45 2.5 2.55 2.6 2.55 2.6 r =2 0.1 0.05 2x slower 0 -0.05 -0.1 2.35 Dan Ellis (Ellis & Mandel) 2.5 Intro & DSP time/s January 22, 2009 18 / 33 Time-domain TSM Problem: want to preserve local time structure but alter global time structure Repeat segments I but: artifacts from abrupt edges Cross-fade & overlap y m [mL + n] = y m−1 [mL + n] + w [n] · x 0.1 1 2 3 4 5 hj m k r L+n i 6 0 -0.1 2.35 1 2.4 2.45 2.5 4 2 1 1 2.6 time / s 6 3 0.1 2.55 5 2 2 3 3 4 4 5 5 6 4.95 time / s 0 -0.1 Dan Ellis (Ellis & Mandel) 4.7 4.75 4.8 4.85 Intro & DSP 4.9 January 22, 2009 19 / 33 Synchronous overlap-add (SOLA) Idea: allow some leeway in placing window to optimize alignment of waveforms 1 2 Km maximizes alignment of 1 and 2 Hence, m y [mL + n] = y m−1 [mL + n] + w [n] · x hj m k r L + n + Km i Where Km chosen by cross-correlation: PNov + n] · x mr L + n + K Km = argmax qP P 0≤K ≤Ku (y m−1 [mL + n])2 (x mr L + n + K )2 n=0 y Dan Ellis (Ellis & Mandel) m−1 [mL Intro & DSP January 22, 2009 20 / 33 The Fourier domain Fourier Series (periodic continuous x) Ω0 = x(t) = 2π T X x(t) 0.5 0 0.5 ck e jkΩ0 t 1 1.5 1 2πT 0.5 0 0.5 1 1.5 t 1.0 k ck = 1 Z T /2 |ck| x(t)e −jkΩ0 t dt 1 2 3 4 5 6 7 k −T /2 Fourier Transform (aperiodic continuous x) 0.02 x(t) 0.01 Z 1 X (jΩ)e jΩt dΩ 2π Z X (jΩ) = x(t)e −jΩt dt x(t) = 0 -0.01 0 0.004 0.006 0.008 time / sec |X(jΩ)| -40 -60 -80 0 Dan Ellis (Ellis & Mandel) 0.002 level / dB -20 Intro & DSP 2000 4000 6000 8000 freq / Hz January 22, 2009 21 / 33 Discrete-time Fourier DT Fourier Transform (aperiodic sampled x) x [n] x[n] = X (e jω ) = 1 2π X Z π X (e jω )e jωn dω −π -1 n 1 2 3 4 5 6 7 |X(ejω)| 3 x[n]e −jωn 2 1 0 π ω 2π 3π 4π 5π Discrete Fourier Transform (N-point x) x[n] = X X [k] = X X [k]e j x [n] 2πkn N k x[n]e |X[k]| n k=1... Dan Ellis (Ellis & Mandel) n |X(ejω)| 1 2 3 4 5 6 7 −j 2πkn N Intro & DSP k January 22, 2009 22 / 33 Sampling and aliasing Discrete-time signals equal the continuous time signal at discrete sampling instants: xd [n] = xc (nT ) Sampling cannot represent rapid fluctuations 1 0.5 0 0.5 1 0 1 sin ΩM + 2 2π T 3 4 5 6 7 8 9 10 Tn = sin(ΩM Tn) ∀n ∈ Z Nyquist limit (ΩT /2) from periodic spectrum: Gp(jΩ) −ΩM −ΩT −ΩT + ΩM Dan Ellis (Ellis & Mandel) “alias” of “baseband” signal Ga(jΩ) ΩM Intro & DSP ΩT ΩT - Ω M Ω January 22, 2009 23 / 33 Speech sounds in the Fourier domain energy / dB time domain 0.1 Vowel: periodic “has” 0 -0.1 1.37 0.05 1.38 1.39 1.4 1.41 1.42 time / s -60 -80 -100 freq / Hz 0 1000 2000 3000 0 1000 2000 3000 4000 0 1000 2000 3000 4000 0 1000 2000 3000 4000 -60 Fricative: aperiodic 0 “watch” -0.05 1.86 0.1 frequency domain -40 -80 1.87 1.88 1.89 1.9 -100 1.91 -40 -60 Glide: transition “watch” 0 -80 -0.1 1.52 1.54 1.56 1.58 -60 0.02 Stop: transient “dime” -100 0 -80 -0.02 2.42 2.44 2.46 2.48 -100 dB = 20 log10 (amplitude) = 10 log10 (power) Voiced spectrum has pitch + formants Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 24 / 33 Short-time Fourier Transform Want to localize energy in time and frequency break sound into short-time pieces calculate DFT of each one 0.1 L 2L 3L 0 -0.1 2.35 2.4 short-time window 2.45 2.5 2.55 2.6 time / s DFT k→ freq / Hz 4000 3000 2000 1000 0 m= 0 m=1 m=2 m=3 Mathematically, X [k, m] = N−1 X n=0 Dan Ellis (Ellis & Mandel) 2πk(n − mL) x[n] w [n − mL] exp −j N Intro & DSP January 22, 2009 25 / 33 The Spectrogram Plot STFT X [k, m] as a gray-scale image 0.1 4000 10 0 3000 -10 2000 -20 -30 1000 intensity / dB freq / Hz 0 -40 0 2.35 2.4 2.45 2.5 2.55 2.6 -50 time / s freq / Hz 4000 3000 2000 1000 0 0 Dan Ellis (Ellis & Mandel) 0.5 1 1.5 Intro & DSP 2 2.5 time / s January 22, 2009 26 / 33 Time-frequency tradeoff Longer window w [n] gains frequency resolution at cost of time resolution 0.2 freq / Hz 4000 3000 2000 10 0 1000 -10 0 freq / Hz Window = 48 pt Wideband Window = 256 pt Narrowband 0 -20 4000 -30 -40 3000 -50 level / dB 2000 1000 0 Dan Ellis (Ellis & Mandel) 1.4 1.6 1.8 2 Intro & DSP 2.2 2.4 2.6 time / s January 22, 2009 27 / 33 Speech sounds on the Spectrogram Stop: transient “dime” 1.6 Fric've: aperiodic “watch” Glide: transition “watch” 1.4 freq / Hz Vowel: periodic “has” Most popular speech visualization 4000 3000 2000 1000 0 has a 1.8 2 watch 2.2 thin 2.4 as a 2.6 time/s dime Wideband (short window) better than narrowband (long window) to see formants Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 28 / 33 TSM with the Spectrogram Just stretch out the spectrogram? 4000 Frequency 3000 2000 1000 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Time 4000 Frequency 3000 2000 1000 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Time how to resynthesize? spectrogram is only |Y [k, m]| Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 29 / 33 The Phase Vocoder Timescale modification in the STFT domain Magnitude from ‘stretched’ spectrogram: h m i |Y [k, m]| = X k, r I e.g. by linear interpolation But preserve phase increment between slices: h mi θ̇Y [k, m] = θ̇X k, r I e.g. by discrete differentiator Does right thing for single sinusoid I keeps overlapped parts of sinusoid aligned ∆T . θ = ∆θ ∆T Dan Ellis (Ellis & Mandel) time . ∆θ' = θ·2∆T Intro & DSP January 22, 2009 30 / 33 General issues in TSM Time window I stretching a narrowband spectrogram Malleability of different sounds I vowels stretch well, stops lose nature Not a well-formed problem? want to alter time without frequency . . . but time and frequency are not separate! I ‘satisfying’ result is a subjective judgment ⇒ solution depends on auditory perception. . . I Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 31 / 33 Summary Information in sound I lots of it, multiple levels of abstraction Course overview I I survey of audio processing topics practicals, readings, project DSP review I I digital signals, time domain Fourier domain, STFT Timescale modification I I I properties of the speech signal time-domain phase vocoder Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 32 / 33 References J. L. Flanagan and R. M. Golden. Phase vocoder. Bell System Technical Journal, pages 1493–1509, 1966. M. Dolson. The Phase Vocoder: A Tutorial. Computer Music Journal, 10(4):14–27, 1986. M. Puckette. Phase-locked vocoder. In Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pages 222–225, 1995. A. T. Cemgil and S. J. Godsill. Probabilistic Phase Vocoder and its application to Interpolation of Missing Values in Audio Signals. In 13th European Signal Processing Conference, Antalya, Turkey, 2005. Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 33 / 33