EE E6820: Speech & Audio Processing & Recognition Lecture 6: Nonspeech and Music Dan Ellis <dpwe@ee.columbia.edu> Michael Mandel <mim@ee.columbia.edu> Columbia University Dept. of Electrical Engineering http://www.ee.columbia.edu/∼dpwe/e6820 February 26, 2009 1 2 3 4 Music & nonspeech Environmental Sounds Music Synthesis Techniques Sinewave Synthesis E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 1 / 30 Outline 1 Music & nonspeech 2 Environmental Sounds 3 Music Synthesis Techniques 4 Sinewave Synthesis E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 2 / 30 Music & nonspeech What is ‘nonspeech’ ? I I according to research effort: a little music in the world: most everything high Information content speech low music animal sounds wind & water natural contact/ collision Origin machines & engines man-made attributes? E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 3 / 30 Sound attributes Attributes suggest model parameters What do we notice about ‘general’ sound? I I I psychophysics: pitch, loudness, ‘timbre’ bright/dull; sharp/soft; grating/soothing sound is not ‘abstract’: tendency is to describe by source-events Ecological perspective what matters about sound is ‘what happened’ → our percepts express this more-or-less directly I E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 4 / 30 Motivations for modeling Describe/classify I cast sound into model because want to use the resulting parameters Store/transmit I model implicitly exploits limited structure of signal Resynthesize/modify I model separates out interesting parameters Sound E6820 (Ellis & Mandel) Model parameter space L6: Nonspeech and Music February 26, 2009 5 / 30 Analysis and synthesis Analysis is the converse of synthesis: Model / representation Synthesis Analysis Sound Can exist apart: I I analysis for classification synthesis of artificial sounds Often used together: I I I encoding/decoding of compressed formats resynthesis based on analyses analysis-by-synthesis E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 6 / 30 Outline 1 Music & nonspeech 2 Environmental Sounds 3 Music Synthesis Techniques 4 Sinewave Synthesis E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 7 / 30 Environmental Sounds Where sound comes from: mechanical interactions I I I contact / collisions rubbing / scraping ringing / vibrating Interest in environmental sounds I I carry information about events around us .. including indirect hints need to create them in virtual environments .. including soundtracks Approaches to synthesis I I recording / sampling synthesis algorithms E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 8 / 30 Collision sounds Factors influencing: I I I colliding bodies: size, material, damping local properties at contact point (hardness) energy of collision Source-filter model I I ”source” = excitation of collision event (energy, local properties at contact) ”filter” = resonance and radiation of energy (body properties) Variety of strike/scraping sounds I resonant freqs ∼ size/shape damping ∼ material HF content in excitation/strike ∼ mallet, force I (from Gaver, 1993) I I E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 9 / 30 Sound textures What do we hear in: I I a city street a symphony orchestra How do we distinguish: I I I I waterfall rainfall applause static Rain01 5000 4000 4000 freq / Hz freq / Hz Applause04 5000 3000 3000 2000 2000 1000 1000 0 0 1 2 3 4 0 time / s 0 1 2 3 4 time / s Levels of ecological description... E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 10 / 30 Sound texture modeling (Athineos) Model broad spectral structure with LPC I could just resynthesize with noise Model fine temporal structure in residual with linear prediction in time domain y[n] Sound e[n] E[k] TD-LP FD-LP DCT y[n] = Σ iaiy[n-i] Whitened Residual E[k] = ΣibiE[k-i] residual spectrum Per-frame spectral parameters I I Per-frame temporal envelope parameters precise dual of LPC in frequency ‘poles’ model temporal events Temporal envelopes (40 poles, 256ms) 0.02 0 amplitude -0.02 0.03 0.02 0.01 0.05 0.1 0.15 0.2 time / sec 0.25 Allows modification / synthesis? E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 11 / 30 Outline 1 Music & nonspeech 2 Environmental Sounds 3 Music Synthesis Techniques 4 Sinewave Synthesis E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 12 / 30 Music synthesis techniques HA http://www.score -on-line .com MESSIAH What is music? I 44. chorus could be anything → flexible synthesisHALLELUJAH! needed! Allegro music Key elements of conventional 3 # j j œ. œ œ œ Œ & # c „ ∑ instruments J → note-events (time, pitch, accent level) 3 j # œ . œ œj œj Œ & # c „ ∑ → melody, harmony, rhythm I patterns of repetition & variation 3 . I Soprano Hal - le - lu - jah! Alto # V # c Synthesis framework: Tenor „ ∑ œ œ œ œ Œ J J J Hal - le - lu - jah! G. F. œ . œj œ œj ‰ œ œ R R J œ œ œ œ œ œ J J ‰R R J J r r j j r r j œ . œ œj œj ‰ œ œ Jœ œ ‰ œ œ Jœ œ Hal - le - lu - jah! Hal - le - œ. œ œ œ ‰ œ œ J J J R R Hal - le - lu - jah! Hal - le - lu - jah! Hal - le - lu - jah œ œ J Jœ ‰ Rœ Rœ J Jœ lu - jah! Hal - le - lu - jah 3 instruments: common for many notes ? ## framework c „ ∑ œ . Jœ Jœ œ Œ œ . Jœ Jœ œ ‰ Rœ Rœ Jœ œ ‰ Rœ Rœ Jœ œ J J J score: sequence of (time, pitch, level) note events J Hal - le - lu - jah! Hal - le - lu - jah! Hal - le - Hal - le - lu - jah! Hal - le - lu - jah! Hal - le - lu - jah lu - jah! Hal - le - lu - jah Bass 7 S A # & # Jœ œ Jœ œ Œ & ## le j # œ V # Jœ œ œ Œ E6820 (Ellis & Mandel) - lu - jah, ? ## œ œ œ Jœ œ Œ le B lu - jah, œ œ œ œj œ Œ le T - - lu - jah, Hal - le - lu - jah! œ . œj Jœ œ Œ J j j j œ. œ œ œ Œ Hal - le - lu - jah, œ . Jœ Jœ œ Œ J œ œ . œ J J œ Œ J œ . œj Jœ œ ‰ œ œ J R R j j j r r j j r r j j œ. œ œ œ ‰ œ œ œ œ ‰ œ œ œ œ Hal - le - lu - jah, Hal - le œ . Jœ Jœ œ ‰ Rœ Rœ J œ œ . œ J J œ ‰ Rœ Rœ J - Hal - le - lu - jah, Hal - le - lu - jah, Hal - le - Hal - le - lu - jah, Hal - le - lu - jah, Hal - le - L6: Nonspeech and Music œ œ J Jœ ‰ Rœ Rœ J Jœ lu - jah, Hal - le - lu - jah, œ œ œ œ J Jœ ‰ R R J Jœ lu - jah, Hal - le - lu - jah, œ œ J Jœ ‰ Rœ Rœ J Jœ lu - jah, February 26, 2009 Hal - le - lu - jah, 13 / 30 The nature of musical instrument notes Characterized by instrument (register), note, loudness/emphasis, articulation... Frequency Piano Violin 4000 4000 3000 3000 2000 2000 1000 0 1000 0 0 1 2 3 Time 4 0 1 Frequency Clarinet 4000 3000 3000 2000 2000 1000 0 1000 0 1 2 3 Time 4 Trumpet 4000 0 2 3 Time 4 0 1 2 3 Time 4 Distinguish how? E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 14 / 30 Development of music synthesis Goals of music synthesis: I I generate realistic / pleasant new notes control / explore timbre (quality) Earliest computer systems in 1960s (voice synthesis, algorithmic) Pure synthesis approaches: I I I 1970s: Analog synths 1980s: FM (Stanford/Yamaha) 1990s: Physical modeling, hybrids Analysis-synthesis methods: I I I sampling / wavetables sinusoid modeling harmonics + noise (+ transients) others? E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 15 / 30 Analog synthesis The minimum to make an ‘interesting’ sound Envelope Trigger Pitch t Vibrato Oscillator t Filter + + + Cutoff freq f Sound Gain Elements: I I I I harmonics-rich oscillators time-varying filters time-varying envelope modulation: low frequency + envelope-based Result: I time-varying spectrum, independent pitch E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 16 / 30 FM synthesis Fast frequency modulation P→ sidebands: cos(ωc t + β sin(ωm t)) = ∞ n=−∞ Jn (β) cos((ωc + nωm )t) I a harmonic series if ωc = r ωm Jn (β) is a Bessel function: 1 J0 0.5 J1 J2 J3 J4 Jn(β) ≈ 0 for β < n - 2 0 -0.5 0 1 2 3 4 5 6 7 8 9 modulation index β → Complex harmonic spectra by varying β 4000 freq / Hz 3000 2000 1000 0 I I 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 time / s ωc = 2000 Hz, ωm = 200 Hz what use? E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 17 / 30 Sampling synthesis Resynthesis from real notes → vary pitch, duration, level Pitch: stretch (resample) waveform 0.2 0.2 596 Hz 0.1 0.1 0 0 -0.1 -0.2 0 894 Hz -0.1 0.002 0.004 0.006 0.008 -0.2 0 time / s 0.002 0.004 0.006 0.008 time / s Duration: loop a ‘sustain’ section 0.2 0.2 0.1 0.174 0.176 0 0.1 0.204 0.206 0 -0.1 -0.1 -0.2 0 -0.2 0.1 0.2 0.3 0 time / s 0.1 0.2 0.3 time / s Level: cross-fade different examples 0.2 0.2 mix Soft 0.1 0 -0.1 -0.2 I I Loud 0.1 0 veloc 0 0.05 0.1 0.15 time / s -0.1 -0.2 0 0.05 0.1 0.15 time / s need to ‘line up’ source samples good & bad? E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 18 / 30 Outline 1 Music & nonspeech 2 Environmental Sounds 3 Music Synthesis Techniques 4 Sinewave Synthesis E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 19 / 30 Sinewave synthesis If patterns of harmonics are what matter, why notPgenerate them all explicitly: s[n] = k Ak [n] cos(k · ω0 [n] · n) I particularly powerful model for pitched signals Analysis (as with speech): find peaks in STFT |S[ω, n]| & track or track fundamental ω0 (harmonics / autocorrelation) & sample STFT at k · ω0 → set of Ak [n] to duplicate tone: I freq / Hz I 8000 6000 mag 4000 1 2000 0 2 0 5000 0 0.05 0.1 0.15 0.2 time / s freq / Hz 0 0 0.1 0.2 time / s Synthesis via bank of oscillators E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 20 / 30 Steps to sinewave modeling - 1 The underlying STFT: X [k, n0 ] = N−1 X x[n + n0 ] · w [n] · exp −j n=0 I I 2πkn N what value for N (FFT length & window size)? what value for H (hop size: n0 = r · H, r = 0, 1, 2 . . . )? STFT window length determines freq. resolution: Xw (e jω ) = X (e jω ) ∗ W (e jω ) Choose N long enough to resolve harmonics → 2-3x longest (lowest) fundamental period I e.g. 30-60 ms = 480-960 samples @ 16 kHz I choose H ≤ N/2 N too long → lost time resolution I limits sinusoid amplitude rate of change E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 21 / 30 Steps to sinewave modeling - 2 Choose candidate sinusoids at each time by picking peaks in each STFT frame: freq / Hz 8000 6000 4000 2000 level / dB 0 0 20 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 time / s 0 -20 -40 -60 0 1000 2000 3000 4000 5000 6000 7000 freq / Hz Quadratic fit for peak: y ab2/4 0 -10 0 y = ax(x-b) x b/2 -20 400 600 phase / rad level / dB 20 10 -5 -10 800 freq / Hz 400 600 800 freq / Hz + linear interpolation of unwrapped phase E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 22 / 30 Steps to sinewave modeling - 3 Which peaks to pick? Want ‘true’ sinusoids, not noise fluctuations ‘prominence’ threshold above smoothed spectrum I level / dB 20 0 -20 -40 -60 0 1000 2000 3000 4000 5000 6000 7000 freq / Hz Sinusoids exhibit stability... of amplitude in time of phase derivative in time → compare with adjacent time frames to test? I I E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 23 / 30 Steps to sinewave modeling - 4 freq ‘Grow’ tracks by appending newly-found peaks to existing tracks: birth existing tracks death I new time peaks ambiguous assignments possible Unclaimed new peak I I ‘birth’ of new track backtrack to find earliest trace? No continuation peak for existing track I I ‘death’ of track or: reduce peak threshold for hysteresis E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 24 / 30 Resynthesis of sinewave models After analysis, each track defines contours in frequency, amplitude fk [n], Ak [n] (+ phase?) use to drive a sinewave oscillators & sum up I 3 level 3 2 2 Ak[n]·cos(2πfk[n]·t) Ak[n] 1 0 / Hz 0 700 -1 600 freq 1 -2 fk[n] 0.05 0.1 0.15 0.2 500 0 n -3 0 time / s 0.05 0.1 0.15 time / s 0.2 6000 freq / Hz freq / Hz ‘Regularize’ to exactly harmonic fk [n] = k · f0 [n] 4000 2000 0 0 E6820 (Ellis & Mandel) 0.05 0.1 0.15 0.2 time / s 700 650 600 550 0 L6: Nonspeech and Music 0.05 0.1 0.15 0.2 time / s February 26, 2009 25 / 30 Modification in sinewave resynthesis Change duration by warping timebase I may want to keep onset unwarped freq / Hz 5000 4000 3000 2000 1000 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 time / s Change pitch by scaling frequencies either stretching or resampling envelope 40 level / dB level / dB I 30 20 10 0 0 1000 2000 3000 4000 40 30 20 10 0 0 1000 2000 3000 4000 freq / Hz freq / Hz Change timbre by interpolating parameters E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 26 / 30 Sinusoids + residual Only ‘prominent peaks’ became tracks remainder of spectral energy was noisy? → model residual energy with noise I How to obtain ‘non-harmonic’ spectrum? I I zero-out spectrum near extracted peaks? or: resynthesize (exactly) & subtract waveforms X es [n] = s[n] − Ak [n] cos(2πn · fk [n]) k .. must preserve phase! mag / dB 20 sinusoids 0 original -20 -40 -60 -80 residual LPC 0 1000 2000 3000 4000 5000 6000 7000 freq / Hz Can model residual signal with LPC → flexible representation of noisy residual E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 27 / 30 Sinusoids + noise + transients Sound represented as sinusoids and noise: X s[n] = Ak [n] cos(2πn · fk [n]) + hn [n] ∗ b[n] k Sinusoids Parameters are Ak [n], fk [n], hn [n] Residual es [n] 8000 freq / Hz 6000 8000 4000 6000 2000 4000 8000 6000 2000 0 0 0.2 0.4 0.6 time / s 4000 2000 0 0 0.2 0.4 0.6 SeparatePout abrupt transients in residual? es [n] = k tk [n] + hn [n] ∗ b 0 [n] I more specific → more flexible E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 28 / 30 Summary ‘Nonspeech audio’ I I i.e. sound in general characteristics: ecological Music synthesis I I I control of pitch, duration, loudness, articulation evolution of techniques sinusoids + noise + transients Music analysis... and beyond? E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 29 / 30 References W.W. Gaver. Synthesizing auditory icons. In Proc. Conference on Human factors in computing systems INTERCHI-93, pages 228–235. Addison-Wesley, 1993. M. Athineos and D. P. W. Ellis. Autoregressive modeling of temporal envelopes. IEEE Tr. Signal Proc., 15(11):5237–5245, 2007. URL http://www.ee.columbia.edu/~dpwe/pubs/AthinE07-fdlp.pdf. X. Serra and J. Smith III. Spectral Modeling Synthesis: A Sound Analysis/Synthesis System Based on a Deterministic Plus Stochastic Decomposition. Computer Music Journal, 14(4):12–24, 1990. T. S. Verma and T. H. Y. Meng. An analysis/synthesis tool for transient signals that allows aflexible sines+ transients+ noise model for audio. In Proc. ICASSP, pages VI–3573–3576, Seattle, 1998. E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 30 / 30