Lecture 6: Nonspeech and Music

advertisement
EE E6820: Speech & Audio Processing & Recognition
Lecture 6:
Nonspeech and Music
Dan Ellis <dpwe@ee.columbia.edu>
Michael Mandel <mim@ee.columbia.edu>
Columbia University Dept. of Electrical Engineering
http://www.ee.columbia.edu/∼dpwe/e6820
February 26, 2009
1
2
3
4
Music & nonspeech
Environmental Sounds
Music Synthesis Techniques
Sinewave Synthesis
E6820 (Ellis & Mandel)
L6: Nonspeech and Music
February 26, 2009
1 / 30
Outline
1
Music & nonspeech
2
Environmental Sounds
3
Music Synthesis Techniques
4
Sinewave Synthesis
E6820 (Ellis & Mandel)
L6: Nonspeech and Music
February 26, 2009
2 / 30
Music & nonspeech
What is ‘nonspeech’ ?
I
I
according to research effort: a little music
in the world: most everything
high
Information
content
speech
low
music
animal
sounds
wind &
water
natural
contact/
collision
Origin
machines
& engines
man-made
attributes?
E6820 (Ellis & Mandel)
L6: Nonspeech and Music
February 26, 2009
3 / 30
Sound attributes
Attributes suggest model parameters
What do we notice about ‘general’ sound?
I
I
I
psychophysics: pitch, loudness, ‘timbre’
bright/dull; sharp/soft; grating/soothing
sound is not ‘abstract’:
tendency is to describe by source-events
Ecological perspective
what matters about sound is ‘what happened’
→ our percepts express this more-or-less directly
I
E6820 (Ellis & Mandel)
L6: Nonspeech and Music
February 26, 2009
4 / 30
Motivations for modeling
Describe/classify
I
cast sound into model because want to use the resulting
parameters
Store/transmit
I
model implicitly exploits limited structure of signal
Resynthesize/modify
I
model separates out interesting parameters
Sound
E6820 (Ellis & Mandel)
Model parameter
space
L6: Nonspeech and Music
February 26, 2009
5 / 30
Analysis and synthesis
Analysis is the converse of synthesis:
Model /
representation
Synthesis
Analysis
Sound
Can exist apart:
I
I
analysis for classification
synthesis of artificial sounds
Often used together:
I
I
I
encoding/decoding of compressed formats
resynthesis based on analyses
analysis-by-synthesis
E6820 (Ellis & Mandel)
L6: Nonspeech and Music
February 26, 2009
6 / 30
Outline
1
Music & nonspeech
2
Environmental Sounds
3
Music Synthesis Techniques
4
Sinewave Synthesis
E6820 (Ellis & Mandel)
L6: Nonspeech and Music
February 26, 2009
7 / 30
Environmental Sounds
Where sound comes from:
mechanical interactions
I
I
I
contact / collisions
rubbing / scraping
ringing / vibrating
Interest in environmental sounds
I
I
carry information about events around us
.. including indirect hints
need to create them in virtual environments
.. including soundtracks
Approaches to synthesis
I
I
recording / sampling
synthesis algorithms
E6820 (Ellis & Mandel)
L6: Nonspeech and Music
February 26, 2009
8 / 30
Collision sounds
Factors influencing:
I
I
I
colliding bodies: size, material, damping
local properties at contact point (hardness)
energy of collision
Source-filter model
I
I
”source” = excitation of collision event
(energy, local properties at contact)
”filter” = resonance and radiation of energy
(body properties)
Variety of strike/scraping sounds
I
resonant freqs ∼ size/shape
damping ∼ material
HF content in excitation/strike ∼ mallet, force
I
(from Gaver, 1993)
I
I
E6820 (Ellis & Mandel)
L6: Nonspeech and Music
February 26, 2009
9 / 30
Sound textures
What do we hear in:
I
I
a city street
a symphony orchestra
How do we distinguish:
I
I
I
I
waterfall
rainfall
applause
static
Rain01
5000
4000
4000
freq / Hz
freq / Hz
Applause04
5000
3000
3000
2000
2000
1000
1000
0
0
1
2
3
4
0
time / s
0
1
2
3
4
time / s
Levels of ecological description...
E6820 (Ellis & Mandel)
L6: Nonspeech and Music
February 26, 2009
10 / 30
Sound texture modeling
(Athineos)
Model broad spectral structure with LPC
I
could just resynthesize with noise
Model fine temporal structure in residual with
linear prediction in time domain
y[n]
Sound
e[n]
E[k]
TD-LP
FD-LP
DCT
y[n] = Σ iaiy[n-i] Whitened
Residual E[k] = ΣibiE[k-i]
residual
spectrum
Per-frame
spectral
parameters
I
I
Per-frame
temporal envelope
parameters
precise dual of LPC in frequency
‘poles’ model temporal events
Temporal envelopes (40 poles, 256ms)
0.02
0
amplitude
-0.02
0.03
0.02
0.01
0.05
0.1
0.15
0.2
time / sec 0.25
Allows modification / synthesis?
E6820 (Ellis & Mandel)
L6: Nonspeech and Music
February 26, 2009
11 / 30
Outline
1
Music & nonspeech
2
Environmental Sounds
3
Music Synthesis Techniques
4
Sinewave Synthesis
E6820 (Ellis & Mandel)
L6: Nonspeech and Music
February 26, 2009
12 / 30
Music synthesis techniques
HA
http://www.score -on-line .com
MESSIAH
What is music?
I
44. chorus
could be anything → flexible synthesisHALLELUJAH!
needed!
Allegro music
Key elements of conventional
3
#
j
j
œ. œ œ œ Œ
& # c „ ∑
instruments
J
→ note-events (time, pitch,
accent
level)
3
j
#
œ . œ œj œj Œ
& # c „ ∑
→ melody, harmony, rhythm
I patterns of repetition & variation
3
.
I
Soprano
Hal - le - lu - jah!
Alto
#
V # c
Synthesis framework:
Tenor
„ ∑
œ œ œ œ Œ
J J J
Hal - le - lu - jah!
G. F.
œ . œj œ œj ‰ œ œ
R R
J
œ œ œ œ œ œ
J J ‰R R J J
r r
j
j r r j
œ . œ œj œj ‰ œ œ Jœ œ ‰ œ œ Jœ œ
Hal - le - lu - jah!
Hal - le
-
œ. œ œ œ ‰ œ œ
J J J R R
Hal - le - lu - jah!
Hal - le
-
lu - jah!
Hal - le - lu - jah
œ
œ
J Jœ ‰ Rœ Rœ J Jœ
lu - jah!
Hal - le - lu - jah
3
instruments: common
for many notes
? ## framework
c „ ∑
œ . Jœ Jœ œ Œ œ . Jœ Jœ œ ‰ Rœ Rœ Jœ œ ‰ Rœ Rœ Jœ œ
J
J
J
score: sequence of (time, pitch, level)
note
events J
Hal - le - lu - jah!
Hal - le - lu - jah!
Hal - le
-
Hal - le - lu - jah!
Hal - le
-
lu - jah!
Hal - le - lu - jah
lu - jah!
Hal - le - lu - jah
Bass
7
S
A
#
& # Jœ œ Jœ œ Œ
&
##
le
j
# œ
V # Jœ œ œ Œ
E6820 (Ellis & Mandel)
-
lu - jah,
? ## œ œ œ Jœ œ Œ
le
B
lu - jah,
œ œ œ œj œ Œ
le
T
-
-
lu - jah,
Hal - le - lu - jah!
œ . œj Jœ œ Œ
J
j j j
œ. œ œ œ Œ
Hal - le - lu - jah,
œ . Jœ Jœ œ Œ
J
œ
œ
.
œ J J œ Œ
J
œ . œj Jœ œ ‰ œ œ
J R R
j j j r r j j r r j j
œ. œ œ œ ‰ œ œ œ œ ‰ œ œ œ œ
Hal - le - lu - jah,
Hal - le
œ . Jœ Jœ œ ‰ Rœ Rœ
J
œ
œ
.
œ J J œ ‰ Rœ Rœ
J
-
Hal - le - lu - jah,
Hal - le - lu - jah,
Hal - le
-
Hal - le - lu - jah,
Hal - le - lu - jah,
Hal - le
-
L6: Nonspeech and Music
œ
œ
J Jœ ‰ Rœ Rœ J Jœ
lu - jah,
Hal - le - lu - jah,
œ œ œ
œ
J Jœ ‰ R R J Jœ
lu - jah,
Hal - le - lu - jah,
œ
œ
J Jœ ‰ Rœ Rœ J Jœ
lu - jah,
February 26, 2009
Hal - le - lu - jah,
13 / 30
The nature of musical instrument notes
Characterized by instrument (register),
note, loudness/emphasis, articulation...
Frequency
Piano
Violin
4000
4000
3000
3000
2000
2000
1000
0
1000
0
0
1
2
3
Time
4
0
1
Frequency
Clarinet
4000
3000
3000
2000
2000
1000
0
1000
0
1
2
3
Time
4
Trumpet
4000
0
2
3
Time
4
0
1
2
3
Time
4
Distinguish how?
E6820 (Ellis & Mandel)
L6: Nonspeech and Music
February 26, 2009
14 / 30
Development of music synthesis
Goals of music synthesis:
I
I
generate realistic / pleasant new notes
control / explore timbre (quality)
Earliest computer systems in 1960s
(voice synthesis, algorithmic)
Pure synthesis approaches:
I
I
I
1970s: Analog synths
1980s: FM (Stanford/Yamaha)
1990s: Physical modeling, hybrids
Analysis-synthesis methods:
I
I
I
sampling / wavetables
sinusoid modeling
harmonics + noise (+ transients)
others?
E6820 (Ellis & Mandel)
L6: Nonspeech and Music
February 26, 2009
15 / 30
Analog synthesis
The minimum to make an ‘interesting’ sound
Envelope
Trigger
Pitch
t
Vibrato
Oscillator
t
Filter
+
+
+
Cutoff
freq
f
Sound
Gain
Elements:
I
I
I
I
harmonics-rich oscillators
time-varying filters
time-varying envelope
modulation: low frequency + envelope-based
Result:
I
time-varying spectrum, independent pitch
E6820 (Ellis & Mandel)
L6: Nonspeech and Music
February 26, 2009
16 / 30
FM synthesis
Fast frequency modulation
P→ sidebands:
cos(ωc t + β sin(ωm t)) = ∞
n=−∞ Jn (β) cos((ωc + nωm )t)
I
a harmonic series if ωc = r ωm
Jn (β) is a Bessel function:
1
J0
0.5
J1
J2
J3
J4
Jn(β) ≈ 0
for β < n - 2
0
-0.5
0
1
2
3
4
5
6
7
8
9
modulation index β
→ Complex harmonic spectra by varying β
4000
freq / Hz
3000
2000
1000
0
I
I
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
time / s
ωc = 2000 Hz, ωm = 200 Hz
what use?
E6820 (Ellis & Mandel)
L6: Nonspeech and Music
February 26, 2009
17 / 30
Sampling synthesis
Resynthesis from real notes
→ vary pitch, duration, level
Pitch: stretch (resample) waveform
0.2
0.2
596 Hz
0.1
0.1
0
0
-0.1
-0.2
0
894 Hz
-0.1
0.002
0.004
0.006
0.008
-0.2
0
time / s
0.002
0.004
0.006
0.008
time / s
Duration: loop a ‘sustain’ section
0.2
0.2
0.1
0.174 0.176
0
0.1
0.204 0.206
0
-0.1
-0.1
-0.2
0
-0.2
0.1
0.2
0.3
0
time / s
0.1
0.2
0.3
time / s
Level: cross-fade different examples
0.2
0.2
mix
Soft
0.1
0
-0.1
-0.2
I
I
Loud
0.1
0
veloc
0
0.05
0.1
0.15
time / s
-0.1
-0.2
0
0.05
0.1
0.15
time / s
need to ‘line up’ source samples
good & bad?
E6820 (Ellis & Mandel)
L6: Nonspeech and Music
February 26, 2009
18 / 30
Outline
1
Music & nonspeech
2
Environmental Sounds
3
Music Synthesis Techniques
4
Sinewave Synthesis
E6820 (Ellis & Mandel)
L6: Nonspeech and Music
February 26, 2009
19 / 30
Sinewave synthesis
If patterns of harmonics are what matter,
why notPgenerate them all explicitly:
s[n] = k Ak [n] cos(k · ω0 [n] · n)
I
particularly powerful model for pitched signals
Analysis (as with speech):
find peaks in STFT |S[ω, n]| & track
or track fundamental ω0 (harmonics / autocorrelation)
& sample STFT at k · ω0
→ set of Ak [n] to duplicate tone:
I
freq / Hz
I
8000
6000
mag
4000
1
2000
0
2
0
5000
0
0.05
0.1
0.15
0.2
time / s
freq / Hz
0
0
0.1
0.2
time / s
Synthesis via bank of oscillators
E6820 (Ellis & Mandel)
L6: Nonspeech and Music
February 26, 2009
20 / 30
Steps to sinewave modeling - 1
The underlying STFT:
X [k, n0 ] =
N−1
X
x[n + n0 ] · w [n] · exp −j
n=0
I
I
2πkn
N
what value for N (FFT length & window size)?
what value for H (hop size: n0 = r · H, r = 0, 1, 2 . . . )?
STFT window length determines freq. resolution:
Xw (e jω ) = X (e jω ) ∗ W (e jω )
Choose N long enough to resolve harmonics
→ 2-3x longest (lowest) fundamental period
I e.g. 30-60 ms = 480-960 samples @ 16 kHz
I choose H ≤ N/2
N too long → lost time resolution
I
limits sinusoid amplitude rate of change
E6820 (Ellis & Mandel)
L6: Nonspeech and Music
February 26, 2009
21 / 30
Steps to sinewave modeling - 2
Choose candidate sinusoids at each time
by picking peaks in each STFT frame:
freq / Hz
8000
6000
4000
2000
level / dB
0
0
20
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
time / s
0
-20
-40
-60
0
1000
2000
3000
4000
5000
6000
7000 freq / Hz
Quadratic fit for peak:
y
ab2/4
0
-10
0
y = ax(x-b)
x
b/2
-20
400
600
phase / rad
level / dB
20
10
-5
-10
800 freq / Hz
400
600
800 freq / Hz
+ linear interpolation of unwrapped phase
E6820 (Ellis & Mandel)
L6: Nonspeech and Music
February 26, 2009
22 / 30
Steps to sinewave modeling - 3
Which peaks to pick?
Want ‘true’ sinusoids, not noise fluctuations
‘prominence’ threshold above smoothed spectrum
I
level / dB
20
0
-20
-40
-60
0
1000
2000
3000
4000
5000
6000
7000
freq / Hz
Sinusoids exhibit stability...
of amplitude in time
of phase derivative in time
→ compare with adjacent time frames to test?
I
I
E6820 (Ellis & Mandel)
L6: Nonspeech and Music
February 26, 2009
23 / 30
Steps to sinewave modeling - 4
freq
‘Grow’ tracks by appending newly-found peaks
to existing tracks:
birth
existing
tracks
death
I
new time
peaks
ambiguous assignments possible
Unclaimed new peak
I
I
‘birth’ of new track
backtrack to find earliest trace?
No continuation peak for existing track
I
I
‘death’ of track
or: reduce peak threshold for hysteresis
E6820 (Ellis & Mandel)
L6: Nonspeech and Music
February 26, 2009
24 / 30
Resynthesis of sinewave models
After analysis, each track defines contours in
frequency, amplitude fk [n], Ak [n] (+ phase?)
use to drive a sinewave oscillators & sum up
I
3
level
3
2
2
Ak[n]·cos(2πfk[n]·t)
Ak[n]
1
0
/ Hz
0
700
-1
600
freq
1
-2
fk[n]
0.05 0.1 0.15 0.2
500 0
n
-3
0
time / s
0.05
0.1 0.15
time / s
0.2
6000
freq / Hz
freq / Hz
‘Regularize’ to exactly harmonic fk [n] = k · f0 [n]
4000
2000
0
0
E6820 (Ellis & Mandel)
0.05
0.1 0.15
0.2
time / s
700
650
600
550
0
L6: Nonspeech and Music
0.05
0.1 0.15
0.2
time / s
February 26, 2009
25 / 30
Modification in sinewave resynthesis
Change duration by warping timebase
I
may want to keep onset unwarped
freq / Hz
5000
4000
3000
2000
1000
0
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
time / s
Change pitch by scaling frequencies
either stretching or resampling envelope
40
level / dB
level / dB
I
30
20
10
0
0
1000 2000 3000 4000
40
30
20
10
0
0
1000 2000 3000 4000
freq / Hz
freq / Hz
Change timbre by interpolating parameters
E6820 (Ellis & Mandel)
L6: Nonspeech and Music
February 26, 2009
26 / 30
Sinusoids + residual
Only ‘prominent peaks’ became tracks
remainder of spectral energy was noisy?
→ model residual energy with noise
I
How to obtain ‘non-harmonic’ spectrum?
I
I
zero-out spectrum near extracted peaks?
or: resynthesize (exactly) & subtract waveforms
X
es [n] = s[n] −
Ak [n] cos(2πn · fk [n])
k
.. must preserve phase!
mag / dB
20
sinusoids
0
original
-20
-40
-60
-80
residual
LPC
0
1000
2000
3000
4000
5000
6000
7000 freq / Hz
Can model residual signal with LPC
→ flexible representation of noisy residual
E6820 (Ellis & Mandel)
L6: Nonspeech and Music
February 26, 2009
27 / 30
Sinusoids + noise + transients
Sound represented as sinusoids and noise:
X
s[n] =
Ak [n] cos(2πn · fk [n]) + hn [n] ∗ b[n]
k
Sinusoids
Parameters are Ak [n], fk [n], hn [n]
Residual es [n]
8000
freq / Hz
6000
8000
4000
6000
2000
4000
8000
6000
2000
0
0
0.2
0.4
0.6
time / s
4000
2000
0
0
0.2
0.4
0.6
SeparatePout abrupt transients in residual?
es [n] = k tk [n] + hn [n] ∗ b 0 [n]
I
more specific → more flexible
E6820 (Ellis & Mandel)
L6: Nonspeech and Music
February 26, 2009
28 / 30
Summary
‘Nonspeech audio’
I
I
i.e. sound in general
characteristics: ecological
Music synthesis
I
I
I
control of pitch, duration, loudness, articulation
evolution of techniques
sinusoids + noise + transients
Music analysis...
and beyond?
E6820 (Ellis & Mandel)
L6: Nonspeech and Music
February 26, 2009
29 / 30
References
W.W. Gaver. Synthesizing auditory icons. In Proc. Conference on Human factors in
computing systems INTERCHI-93, pages 228–235. Addison-Wesley, 1993.
M. Athineos and D. P. W. Ellis. Autoregressive modeling of temporal envelopes. IEEE
Tr. Signal Proc., 15(11):5237–5245, 2007. URL
http://www.ee.columbia.edu/~dpwe/pubs/AthinE07-fdlp.pdf.
X. Serra and J. Smith III. Spectral Modeling Synthesis: A Sound Analysis/Synthesis
System Based on a Deterministic Plus Stochastic Decomposition. Computer Music
Journal, 14(4):12–24, 1990.
T. S. Verma and T. H. Y. Meng. An analysis/synthesis tool for transient signals that
allows aflexible sines+ transients+ noise model for audio. In Proc. ICASSP, pages
VI–3573–3576, Seattle, 1998.
E6820 (Ellis & Mandel)
L6: Nonspeech and Music
February 26, 2009
30 / 30
Download