ELEN E4896 MUSIC SIGNAL PROCESSING Lecture 9: Time & Pitch Scaling 1. 2. 3. 4. Time Scale Modification (TSM) Time-Domain Approaches The Phase Vocoder Sinusoidal Approach Dan Ellis Dept. Electrical Engineering, Columbia University dpwe@ee.columbia.edu E4896 Music Signal Processing (Dan Ellis) http://www.ee.columbia.edu/~dpwe/e4896/ 2013-03-25 - 1 /16 1. Time Scale Modification (TSM) • The basic problem in time scaling: Modify a sound to make it “quicker/slower”? to examine “detail” in speech/performance to synchronize tracks • Why not just change the sampling rate? 0.1 Original 0.05 0 -0.05 “slowing the tape” e.g. for r times longer (slower), xs (t) = x(t/r) -0.1 2.35 0.1 2.45 2.5 2.55 2.6 2x slower 0.05 0 -0.05 -0.1 2.35 E4896 Music Signal Processing (Dan Ellis) 2.4 2.4 2.45 2.5 2.55 2.6 time/s 2013-03-25 - 2 /16 Time & Pitch • Changing speed alters time AND pitch how do we change just one? • Preserve pitch, change time preserve pitch = keep local time structure but changing global time course? • Preserve timing, change pitch analogous problem change time scale, then change sampling rate E4896 Music Signal Processing (Dan Ellis) 2013-03-25 - 3 /16 2. Time-Domain TSM • Pre-digital time/pitch scaling: Fairbanks et al., 1954 Revolving tape heads E4896 Music Signal Processing (Dan Ellis) 2013-03-25 - 4 /16 Simple OLA TSM • The digital equivalent of revolving tape heads y [mL + n] = y m 0.1 1 2 m 1 3 m L+n r [mL + n] + w[n] · x 4 5 6 0 -0.1 2.35 1 2.4 2.45 2.5 4 2 1 1 2.6 time / s 5 6 3 0.1 2.55 2 2 3 3 4 4 5 5 6 4.95 time / s 0 -0.1 4.7 4.75 E4896 Music Signal Processing (Dan Ellis) 4.8 4.85 4.9 2013-03-25 - 5 /16 SOLA / SOLAFS Roucos & Wilgus 1985 Hejna & Musicus 1991 • Simple OLA leads to random phase interactions during overlap so.. search for optimal alignment Km via small offset 1 2 Km maximizes alignment of 1 and 2 y [mL + n] = y m m 1 [mL + n] + w[n] · x m L + n + Km r find best Km with cross-correlation: Km = arg max 0 K Ku Nov n=0 (y m E4896 Music Signal Processing (Dan Ellis) ym 1 [mL + n] · x 1 [mL + n])2 (x m r m r L+n+K L + n + K )2 2013-03-25 - 6 /16 The Importance of Time Window • Preserved “local structure” = structure within each window shorter windows → less structure preserved tw = 25ms tw = 5ms 0.1 0.1 0 0 -0.1 2.4 2.45 2.5 2.55 2.6 2.65 2.7 2.75 2.8 -0.1 2.4 0.1 0.1 0 0 -0.1 4.8 4.85 4.9 4.95 5 5.05 5.1 E4896 Music Signal Processing (Dan Ellis) 5.15 -0.1 4.8 2.45 2.5 2.55 2.6 2.65 2.7 2.75 4.85 4.9 4.95 5 5.05 5.1 5.15 2.8 2013-03-25 - 7 /16 Source-Filter TSM • Decompose into excitation + filter Kawahara 1997 before TSM filter (e.g. LPC) easy to scale - change frame rate excitation scaled e.g. by SOLAFS .. then reapplying filter can reduce artifacts Filter coefficients f Filter analysis Resample in time Residual t E4896 Music Signal Processing (Dan Ellis) Time scale modification Apply filter 2013-03-25 - 8 /16 3. The Phase Vocoder • TSM based on the spectrogram Flanagan & Golden 1966 Dolson 1986 4000 Frequency 3000 2000 1000 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Time 4000 Frequency 3000 2000 1000 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Time just stretching it out it time gives “what we want”: but: STFT magnitude isn’t quite enough E4896 Music Signal Processing (Dan Ellis) 2013-03-25 - 9 /16 Magnitude-only reconstruction need iterate 1 {Y (ej , t)} ... y n (t) = n+1 ( , t) = converges to a solution, but SLOWLY {|Y (ej , t)|} ? {|Y (ej , t)| n ( , t)} {ST F T {y n (t)}} 1 ST F T SNR of reconstruction (dB) • How to recover ST F T Griffin & Lim 1984 Iterative phase estimation 8 7.6 dB 7 6 5 4.7 dB Magnitude quantization only Starting from quantized phase Starting from random phase 4 3.4 dB 3 E4896 Music Signal Processing (Dan Ellis) 0 1 2 3 4 5 6 7 8 Iteration number 2013-03-25 - 10/16 Phase Correction • Essential idea of the phase vocoder phase advance will stretch by time scaling factor ... preserve rate of phase change between slices d ˙ {Y (ej , t)} = instantaneous frequency ( , t) = dt (from differencing adjacent frames, or directly...) • Makes sense for a single sinusoid... ΔT . θ = Δθ ΔT time . Δθ' = θ·2ΔT E4896 Music Signal Processing (Dan Ellis) 2013-03-25 - 11/16 Phase Vocoder Results freq / kHz • Reconstructed phase 8 gives smooth evolution 6 4 2 freq / kHz 0 8 6 4 2 0 0.5 1 1.5 2 2.5 time / s 3 “metallic” extended noise... E4896 Music Signal Processing (Dan Ellis) 2013-03-25 - 12/16 Time Window (again) freq / kHz • STFT time-frequency tradeoff 8 determines what “scaling” means 6 4 2 freq / kHz 0 8 6 4 2 0 0.5 1 1.5 2 2.5 time / s 3 is pitch structure within DFT window? E4896 Music Signal Processing (Dan Ellis) 2013-03-25 - 13/16 4. Sinusoidal TSM • Phase Vocoder essentially treats each STFT bin as a sinusoid with magnitude |Y (ej , t)| and frequency ˙ ( , t) time scaling simply projects that sinusoid forward • But: bins around peak don’t behave that way need phase alignment to achieve interference • Model as sinusoids explicitly peak-picking, magnitude & frequency estimation separate noise signal? Frequency 8000 6000 4000 2000 0 0.5 1 E4896 Music Signal Processing (Dan Ellis) 1.5 Time 2 2.5 3 2013-03-25 - 14/16 Summary • Time / Pitch scale modification Intuitive but difficult to define • Time domain methods Preserve local structure • Spectral methods Elongate features in spectrogram E4896 Music Signal Processing (Dan Ellis) 2013-03-25 - 15/16 References • • • • • • • • M. Covell, M.Withgott, M., Slaney, “Mach1: Nonuniform Time-Scale Modification of Speech,” Proc. ICASSP, vol 1, pp. 349-352,1998. Mark Dolson, “The phase vocoder: A tutorial,” Computer Music Journal, vol. 10, no. 4, pp. 14-27, 1986. G. Fairbanks, W. Everitt, and R. Jaeger, “Method for time of frequency compression-expansion of speech,” Transactions of the IRE Professional Group on Audio, vol.2, no.1, pp. 7-12, Jan 1954. J. L. Flanagan, R. M. Golden, “Phase Vocoder,” Bell System Technical Journal, pp. 1493-1509, November 1966. D. Griffin and J. Lim, “Signal estimation from modified short-time Fourier transform,” IEEE Tr. Acous,, Speech, and Sig. Proc., vol. 32, pp. 236–242, 1984. Don Hejna and Bruce R. Musicus, “The SOLAFS Time-Scale Modification Algorithm,” BBN Technical Report, July 1991. Hideki Kawahara, “Speech Representation and Transformation using Adaptive Interpolation of Weighted Spectrum:VOCODER Revisited,'' Proc. ICASSP, vol.2, pp.1303-1306, 1997. Jean Laroche, “Time and Pitch Scale Modification of Audio Signals.” chapter 7 in Applications of Digitial Signal Processing to Audio and Acoustics, ed. M. Kahrs & K. Brandenburg, Kluwer, 1998. E4896 Music Signal Processing (Dan Ellis) 2013-03-25 - 16/16