Audio Features

advertisement
Audio Features
CS498
Today’s lecture
• Audio Features
• How we hear sound
• How we represent sound
– In the context of this class
Why features?
• Features are a very important area
– Bad features make problems unsolvable
– Good features make problems trivial
• Learning how to pick features is the key
– So is understanding what they mean
A simple example
• Compare two numbers:
x,y = {3,3}
x,z = {3,100}
A simple example
• Compare two numbers:
x −y = 0
x − z = 97
– x,y similar but x,z not so much
• Best way to represent a number is itself!
Moving up a level
• Compare two vectors:
x, y
x, z
1.5
1.5
1
1
0.5
0.5
0
0
0
1
2
3
4
5
6
1.5
1.5
1
1
0.5
0.5
0
0
1
2
3
4
5
6
0
0
1
2
3
4
5
6
0
1
2
3
4
5
6
Moving up a level
• Compare two vectors:
∠x, y = 0.03 rad
∠x, z = 0.7 rad
x − y = 0.16
x − z = 1.07
– Simply generalizing numbers concept
Moving up again
• Compare two longer vectors:
1.5
1
0.5
0
0
10
20
30
40
50
60
70
80
90
100
0
10
20
30
40
50
60
70
80
90
100
1.5
1
0.5
0
Look similar but are not!
• Oops! ∠x, y = 1.57 rad,
x − y = 7.64
1.5
1
0.5
0
0
10
20
30
40
50
60
70
80
90
100
0
10
20
30
40
50
60
70
80
90
100
1.5
1
0.5
0
How about this?
• Are these two vectors the same?
0.8
0.6
0.4
0.2
0
−0.2
−0.4
−0.6
1
2
3
4
5
6
7
4
x 10
0.5
0
−0.5
−1
1
2
3
4
5
6
7
4
x 10
– Not if you look at their norm or angle …
Data norms won’t get you far!
• You need to articulate what matters
– You need to know what matters
• Features are the means to do so
• Let’s examine what matters to our ears
– Our bodies sorta know best
Hearing
• Sounds and hearing
• Human hearing aspects
– Physiology and psychology
• Lessons learned
The hardware
(outer/middle ear)
• The pinna (auricle)
– Aids sound collection
– Does directional filtering
– Holds earrings, etc …
Middle ear
Outer ear
Ear canal
• The ear canal
– About 25mm x 7mm
– Amplifies sound at ~3kHz by ~10dB
– Helps clarify a lot of sounds!
• Ear drum
– End of middle ear, start of inner ear
– Transmits sound as a vibration to the inner ear
Ear drum
Pinna
More hardware
(inner ear)
•
Ear drum (tympanum)
Ossicles
– Excites the ossicles (ear bones)
•
Ossicles
–
–
–
–
•
Malleus (hammer), incus (anvil), stapes (stirrup)
Transfers vibrations from ear drum to the oval window
Amplify sound by ~14dB (peak at ~1kHz)
Muscles connected to ossicles control the acoustic
reflex (damping in presence of loud sounds)
The oval window
Oval window
Auditory
nerve
– Transfers vibrations to the cochlea
Cochlea
•
Eustachian tube
– Used for pressure equalization
Ear drum
Eustachian
tube
The cochlea
•
The “A/D converter”
– Translates oval window vibrations to a
neural signal
– Fluid filled with the basilar membrane in
the middle
– Each section of the basilar membrane
resonates with a different sound
frequency
– Vibrations of the basilar membrane
move sections of hair cells which send
off neural signals to the brain
•
The cochlea acts like the equalizer
display in your stereo
– Frequency domain decomposition
•
Neural signals from the hair cells go to
the auditory nerve
Microscope photograph of hair cells (yellow)
Masking & Critical bands
•
•
When two different sounds excite the same
section of the basilar membrane one is masked
This is observed at the micro-level
–
–
•
There are 24 distinct bands throughout the
cochlea
–
–
•
a.k.a critical bands
Simultaneous excitation on a band by multiple
sources results in a single source percept
There is also some temporal masking
–
•
E.g. two tones at 150Hz and 170Hz, if one tone is
loud enough the other will be inaudible
A tone can also hide a noise band when loud
enough
Preceding sounds mask what’s next
This is a feature which is taken into advantage
by a lot of audio compression
–
Throws away stuff you won’t hear due to masking
Masking for close
frequency tones vs
distant tones
The neural pathways
•
•
A series of neural stops
Cochlear nuclei
Ears
– Prepping/distribution of neural data from cochlea
•
– Coincidence detection across ear signals
– Localization functions
•
Cochleas
Superior Olivary Complex
Inferior Colliculus
Cochlear
nuclei
– Last place where we have most original data
– Probably initiates first auditory images in brain
•
Medial Geniculate Body
– Relays various sound features (frequency, intensity,
etc) to the auditory cortex
•
Superior
olivary
complex
Inferior
colliculus
Auditory Cortex
– Reasoning, recognition, identification, etc
– High-level processing
Medial
geniculate body
Auditory
cortex
?
Stream of
conciousness …
– 20Hz to 20kHz (upper limit decreases
with age/trauma)
– Infrasound (< 20Hz) can be felt through
skin, also as events
– Ultrasound (> 20kHz) can be
“emotionally” perceived (discomfort,
nausea, etc)
• Loudness
– Low limit is 2x10-10 atm
– 0dB SPL to 130dB SPL (but also
frequency dependent)
• A dynamic range of 3x106 to 1!
– 130dB SPL threshold of pain"
– 194dB SPL is definition of a shock
wave, sounds stops!"
Intensity (dB)
• Frequency
-10 0 10 20 30 40 50 60 70 80 90 100 110 120 130
The limits of hearing
Pain!
Audible sounds
Speech
Music
Inaudibility
16 315 53 125 250 5000 1000 2000 4000 8000 16000
Frequency (Hz)
Tones at various
frequencies, how
high can you hear?
Perception of loudness
•
Loudness is subjective
– Perceived loudness changes with
frequency
– Perception of “twice as loud” is not
really that!
– Ditto for equal loudness
•
Fletcher-Munson curves
– Equal loudness perception curves
through frequencies
•
•
Just noticeable difference is about
1dB SLP
1kHz to 5kHz are the loudest heard
frequencies
– What the ear canal and ossicles
amplify!
•
Low limit shifts up with age!
Perception of pitch
• Pitch is another subjective
(and arbitrary) measure!
• Perception of pitch doubling
doesn’t imply doubling of Hz!
– Mel scale is the perceptual
pitch scale!
– Twice as many Mels
correspond to a perceived
pitch doubling!
• Musically useful range varies
from 30Hz to 4kHz!
• Just noticeable difference is
about 0.5% of frequency!
– Varies with training though!
“Pitch is that attribute of !
auditory sensation in terms !
of which sounds may be !
ordered from low to high”!
- American National Standards Institute!
Perception of timbre
• Timbre is what distinguishes sounds
outside of loudness & pitch
– Another bogus ANSI description
• Timbre is dynamical and can have
many facets which can often include
pitch and loudness variations
– E.g. music instrument identification is
guided largely by intensity fluctuations
through time
• There is not a coherent body of
literature examining human timbre
perception
Gray’s timbre space of
musical instruments
– But there is a huge bibliography on
computational timbre perception!
Examples of successive timbre
changes. Loudness and pitch
are constant
So how to we use all that?
• All these processes are meaningful
– They encapsulate statistics of sounds
– They suggest features to use
• To make machines that cater to our needs
– We need to learn from our perception
A lesson from the cochlea
• Sounds are not vectors
• Sounds are “frequency
ensembles”
• That’s the “perceptual
feature” we care about
Like this!
– But how do we get this?
The “simplest” sound
• Sinusoids are special
– Simplest waveform
– An isolated frequency
• A sinusoid has three
parameters
– Frequency, amplitude & phase
• s(t) = a(t) sin( f t + φ)
• This simplicity makes
sinusoids an excellent
building block for most of
time series
1
0.5
0
-0.5
-1
0
10
20
30
40
50
60
70
80
90
100
Making a square wave with sines
Frequency domain representation
•
Time series can be decomposed
in terms of “sinusoid presence”
– See how many sinusoids you can
add up to get to a good
approximation
– Informally called the spectrum
•
No temporal information in this
representation, only frequency
information
– So a sine with a changing
frequency is a smeared spike
•
Not that great of a representation
for dynamically changing sounds
Time series
Spectrum
1
20
0
10
-1
0
50
100
0
0
20
40
60
0
20
40
60
0
20
40
60
6
1
4
0
2
-1
20
40
60
80
100
0
1
2
0
1
-1
0
50
100
0
Time/frequency representation
Many names/varieties
A time ordered series of
frequency compositions
– Can help show how things move
in both time and frequency
•
The most useful representation
so far!
– Reveals information about the
frequency content without
sacrificing the time info
1
1
Frequency
•
Time/Frequency
0
-1
0
50
0
100
200
Time
300
0
100
200
Time
300
1
1
0
-1
20
40
60
80
0.5
0
100
1
1
0
-1
0.5
0
100
Frequency
– Spectrogram, sonogram,
periodogram, …
Time series
Frequency
•
0
50
100
0.5
0
0
100
200
Time
300
400
A real example
1
Time domain
-0.5
-1
Spectrum
– We can see a lot of bass
and few middle freqs
– But where in time are they?
•
0
Spectrogram
– We can “see” each
individual sound
– And we know how it
sounds like!
0
0.5
1
1.5
2
2.5
3
0
100
200
300
400
500
600
0.6
0.8
1
Time
1.2
2
Frequency domain
•
0.5
Time domain
– We can see the events
– We don’t know how they
sound like though!
1.5
1
0.5
0
8000
6000
Frequency
•
4000
2000
0
0
0.2
0.4
1.4
1.6
The Discrete Fourier Transform
• So how do we get from
time domain to frequency
domain?
– It is a matrix multiplication (a
rotation in fact)
• The Fourier matrix is
square, orthogonal and has
complex-valued elements
F j,k
1 ijk 2π
1 ⎛
jk2π
jk2π ⎞
=
e N =
cos
+ isin
N
N ⎠
N
N⎝
• Multiply a vectorized timeseries with the Fourier
matrix and voila!
The Fourier matrix (real part)
How does the DFT work?
• Multiplying with the Fourier matrix
– We dot product each Fourier row vector
with the input
– If two vectors point the same way their
dot product is maximized
• Each Fourier row picks out a single
sinusoid from the signal
– In fact a complex sinusoid
– Since all the Fourier sinusoids are
orthogonal there is no overlap
• The resulting vector contains how
much of each Fourier sinusoid the
original vector had in it
The DFT in a little more detail
•
– Doesn’t have to, but it is convenient for
other things
•
The DFT result for real signals is
conjugate symmetric
– The middle value is the highest
frequency (Nyquist)
– Working towards the edges we traverse
all frequencies downwards
– The two sides are mutually conjugate
complex numbers
•
The interesting parts of the DFT are the
magnitude and the phase
– Abs( F) = || F ||
– Arg( F) = ∡ F
•
Real and imaginary parts of the DFT of a sine
The DFT features complex numbers
To go back we apply the DFT again
(with some scaling)
200
0
-200
100
200
300
400
500
600
700
800
900
1000
200
300
400
500
600
700
800
900
1000
100
0
-100
100
Corresponding magnitude and phase
300
200
100
100
200
300
400
500
600
700
800
900
1000
100
200
300
400
500
600
700
800
900
1000
2
0
-2
Size of a DFT
• The bigger the DFT
input the more
frequency resolution
– But the more data we
need!
• Zero padding helps
– Stuff a lot of zeros at the
end of the input to make
up for few data
– But we don’t really
infuse any more
information we just
make prettier plots
From the DFT to a spectrogram
The spectrogram is a series of consecutive
magnitude DFTs on a signal
– This series is taken off consecutive
segments of the input
– This reduces “fake” broadband noise
estimates
•
It is wise to make the segments overlap
– Due to windowing
•
The parameters to use are
– The DFT size
– The overlap amount
– The windowing function
0.5
0
-0.5
-1
0
It is best to taper the ends of the segments
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
Magnitude of DFT of
every segment
(segments can overlap)
…
Time series of
magnitude spectra
•
120
100
80
60
40
20
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Looks nicer as an image
120
Spectrogram
•
Input sound
1
100
80
60
40
20
2
4
6
8
10
12
14
16
18
Why window?
– Start and end point must
taper to zero
• Windowing
– Eliminates the sharp edges
that cause broadband noise
0.04
0.02
0
-0.02
-0.04
-0.06
200
400
600
800
1000
1200
200
400
600
800
1000
1200
Windowed
• Discontinuities at ends
cause noise
Not windowed
Nasty sharp edges
0.04
0.02
0
-0.02
-0.04
4
x 10
Windowed
1
0
0
4
x 10
0.2
0.4
0.6
0.8
1
Time
1.2
1.4
1.6
1.8
0
0.2
0.4
0.6
0.8
1
Time
1.2
1.4
1.6
1.8
2
1
0
Non-existent broadband content
Not windowed
Frequency
– Since we have windowed
we need to take
overlapping segments to
make up for the attenuated
parts of the input
2
Frequency
• Overlap
Time/Frequency tradeoff
•
Heisenberg’s uncertainty principle
– We can’t accurately know both the
frequency and the time position of a
wave
– Also in particle physics with speed
and position of a particle
•
Spectrogram problems
– Big DFTs sacrifice temporal resolution
– Small DFTs have lousy frequency
resolution
•
We can use a denser overlap to
compensate
– Ok solution, not great
The Fast Fourier Transform (FFT)
•
The Fourier matrix is special
The Fourier matrix, N = 32
– Many repeating values
– Unique repeating structure
•
We can decompose a Fourier transform to
two Fourier transforms of half the size
– Also includes some twiddling with the data
– Two Fourier smaller transforms are faster
than one big one
– We keep decomposing it until we have a
very small DFT
•
This results into a really fast algorithm that
has driven communications forward!
– The constraint is that the transform size is
best if a power of two so that we can
decompose it repeatedly
Example FFT, N = 8
Emulating the cochlea
• Using the time/frequency domain
•
Take successive
Fourier transforms
0.8
0.6
0.4
0.2
•
Keep their
magnitude
0
−0.2
−0.4
−0.6
Stack them in time
•
Now you can visually
compare sounds!
1
2
3
4
5
6
7
8
4
x 10
Frequency
•
26
51
77
102
128
Time (1k samples)
154
179
205
Back to our example
0.8
0.6
0.4
0.2
0
−0.2
−0.4
−0.6
1
2
3
4
5
6
7
4
x 10
0.5
0
−0.5
−1
1
2
3
4
5
6
7
4
x 10
Corresponding spectrograms
Spectrogram
500
Amplitude
400
300
200
100
5
10
15
20
25
30
Time
35
40
45
50
55
35
40
45
50
55
Spectrogram
500
Amplitude
400
300
200
100
5
10
15
20
25
30
Time
A lesson from loudness perception
• We don’t perceive loudness linearly
• How much louder is the second “test”?
• The magnitude we plot should be
logarithmic, not linear
Log spectrograms
Log spectrogram
500
Amplitude
400
300
200
100
5
10
15
20
25
30
Time
35
40
45
50
55
35
40
45
50
55
Log spectrogram
500
Amplitude
400
300
200
100
5
10
15
20
25
30
Time
A lesson from pitch perception
• Frequencies are not “linear”
– Perceived scale is called mel
• Use that spacing instead
– i.e. warp the frequency axis
“Mel spectra”
Log mel spectrogram
35
Amplitude
30
25
20
15
10
5
5
10
15
20
25
30
Time
35
40
45
50
55
40
45
50
55
Log mel spectrogram
35
Amplitude
30
25
20
15
10
5
5
10
15
20
25
30
Time
35
One more trick
• Mel cepstra
– Smooth the log mel spectra using one more
frequency transform (the DCT)
Mel cepstra
Amplitude
20
15
10
5
5
10
15
20
25
30
Time
35
40
45
50
55
35
40
45
50
55
Mel cepstra
Amplitude
20
15
10
5
5
10
15
20
25
30
Time
Adding some temporal info
• Deltas and delta-deltas
– In sounds order is important
– Using “delta features” we pay attention to change
Mel cepstra
35
Coefficient
30
25
20
15
10
5
5
10
15
20
25
30
Time
35
40
45
50
55
35
40
45
50
55
Mel cepstra
35
Coefficient
30
25
20
15
10
5
5
10
15
20
25
30
Time
What more is there?
• Tons!
–
–
–
–
–
Spectral features
Waveform features
Higher level features
Perceptual parameter features
…
Sound recap
• Go to time/frequency domain
– We do so in the cochlea
• Frequencies are not linear
– We perceive them in another scale
• Amplitude is not linear either
– Use log scale instead
• Resulting features are used a lot
– Further minor tweaks exist (more later)
Next lecture
• Principal Component Analysis
• How to find features automatically
• How to “compress” data without info loss
Download