Using Sound Source Models to Organize Mixtures 1. 2.

advertisement
Using Sound Source Models
to Organize Mixtures
Dan Ellis
Laboratory for Recognition and Organization of Speech and Audio
Dept. Electrical Eng., Columbia Univ., NY USA
dpwe@ee.columbia.edu
1.
2.
3.
4.
http://labrosa.ee.columbia.edu/
Mixtures and Models
Human Sound Organization
Machine Sound Organization
Ambient Sounds
Sound Source Models - Dan Ellis
2007-05-24 - 1 /30
The Problem of Mixtures
“Imagine two narrow channels dug up from the edge of a
lake, with handkerchiefs stretched across each one.
Looking only at the motion of the handkerchiefs, you are
to answer questions such as: How many boats are there
on the lake and where are they?” (after Bregman’90)
• Received waveform is a mixture
2 sensors, N sources - underconstrained
• Undoing mixtures: hearing’s primary goal?
.. by any means available
Sound Source Models - Dan Ellis
2007-05-24 - 2 /30
Sound Organization Scenarios
• Interactive voice systems
human-level understanding is expected
• Speech prostheses
crowds: #1 complaint of hearing aid users
• Archive analysis
freq / kHz
identifying and isolating sound events
dpwe-2004-09-10-13:15:40
8
20
6
0
4
-20
2
0
-40
0
1
2
3
4
5
6
7
8
9
level / dB
time / sec
pa-2004-09-10-131540.wav
• Unmixing/remixing/enhancement...
Sound Source Models - Dan Ellis
2007-05-24 - 3 /30
How Can We Separate?
• By between-sensor differences (spatial cues)
‘steer a null’ onto a compact interfering source
the filtering/signal processing paradigm
• By finding a ‘separable representation’
spectral? sources are broadband but sparse
periodicity? maybe – for pitched sounds
something more signal-specific...
• By inference (based on knowledge/models)
acoustic sources are redundant
→ use part to guess the remainder
- limited possible solutions
Sound Source Models - Dan Ellis
2007-05-24 - 4 /30
Separation vs. Inference
• Ideal separation is rarely possible
i.e. no projection can completely remove overlaps
• Overlaps → Ambiguity
scene analysis = find “most reasonable” explanation
• Ambiguity can be expressed probabilistically
i.e. posteriors of sources {Si} given observations X:
P({Si}| X)
P(X |{Si}) P({Si})
combination physics source models
• Better source models → better inference
.. learn from examples?
Sound Source Models - Dan Ellis
2007-05-24 - 5 /30
A Simple Example
• Source models are codebooks
from separate subspaces

























 



Sound Source Models - Dan Ellis



 







 
2007-05-24 - 6 /30
A Slightly Less Simple Example
• Sources with Markov transitions













 




































       

Sound Source Models - Dan Ellis
















2007-05-24 - 7 /30

What is a Source Model?
• Source Model describes signal behavior
encapsulates constraints on form of signal
(any such constraint can be seen as a model...)
• A model has parameters
model + parameters
→ instance
Excitation
source g[n]
Resonance
filter H(ej!)
n
• What is not a source model?
detail not provided in instance
e.g. using phase from original mixture
constraints on interaction between sources
e.g. independence, clustering attributes
Sound Source Models - Dan Ellis
2007-05-24 - 8 /30
n
!
Outline
1. Mixtures and Models
2. Human Sound Organization
Auditory Scene Analysis
Using source characteristics
Illusions
3. Machine Sound Organization
4. Ambient Sounds
Sound Source Models - Dan Ellis
2007-05-24 - 9 /30
Auditory Scene Analysis Bregman’90
Darwin & Carlyon’95
• How do people analyze sound mixtures?
break mixture into small elements (in time-freq)
elements are grouped in to sources using cues
sources have aggregate attributes
• Grouping rules (Darwin, Carlyon, ...):
cues: common onset/modulation, harmonicity, ...
Onset
map
Sound
Frequency
analysis
Harmonicity
map
Elements
Sources
Grouping
mechanism
Source
properties
Spatial
map
• Also learned “schema” (for speech etc.)
Sound Source Models - Dan Ellis
2007-05-24 - 10/30
(after Darwin
1996)
Perceiving Sources
• Harmonics distinct in ear, but perceived as
freq
one source (“fused”):
time
depends on common onset
depends on harmonics
• Experimental techniques
ask subjects “how many”
match attributes e.g. pitch, vowel identity
brain recordings (EEG “mismatch negativity”)
Sound Source Models - Dan Ellis
2007-05-24 - 11/30
Auditory “Illusions”
freq
pulsation threshold
2,,,
%&$'($)*+
• How do we explain illusions?
3,,,
1,,,
0,,,
+
?
,
+
,-.
/
/-.
!"#$
0
0-.
0
!"#$
0-.
4
/
!"#$
/-.
time
.,,,
%&$'($)*+
sinewave speech
,
+
1,,,
4,,,
0,,,
/,,,
,
phonemic restoration
,-.
/
/-.
4-.
1,,,
• Something is providing the
%&$'($)*+
4,,,
0,,,
/,,,
,
,
,-.
missing (illusory) pieces ... source models
Sound Source Models - Dan Ellis
2007-05-24 - 12/30
0
Human Speech Separation
Brungart et al.’02
• Task: Coordinate Response Measure
“Ready Baron go to green eight now”
256 variants, 16 speakers
correct = color and number for “Baron”
crm-11737+16515.wav
• Accuracy as a function of spatial separation:
A, B same speaker
Sound Source Models - Dan Ellis
o Range effect
2007-05-24 - 13/30
Separation by Vocal Differences
!"#$%&$'()#*"&+$,-"#$'(!$./-#0 Brungart et al.’01
• CRM varying the level and voice character
1232'(4-**2&2#52.
(same spatial
location)
1232'(6-**2&2#52(5$#(72'8(-#(.82257(.20&20$,-"#
energetic
vs. informational masking
more than1-.,2#(*"&(,72(9%-2,2&(,$'/2&:
pitch .. source models
Sound Source Models - Dan Ellis
2007-05-24 - 14/30
Outline
1. Mixtures and Models
2. Human Sound Organization
3. Machine Sound Organization
Computational Auditory Scene Analysis
Dictionary Source Models
4. Ambient Sounds
Sound Source Models - Dan Ellis
2007-05-24 - 15/30
Source Model Issues
• Domain
parsimonious expression of constraints
nice combination physics
• Tractability
size of search space
tricks to speed search/inference
• Acquisition
hand-designed vs. learned
static vs. short-term
• Factorization
independent aspects
hierarchy & specificity
Sound Source Models - Dan Ellis
2007-05-24 - 16/30
Computational Auditory
Scene Analysis Brown & Cooke’94
Okuno et al.’99
Hu & Wang’04 ...
• Central idea:
Segment time-frequency into sources
based on perceptual grouping cues
Segment
input
mixture
Front end
signal
features
(maps)
Group
Object
formation
discrete
objects
Grouping
rules
Source
groups
freq
onset
time
period
frq.mod
... principal cue is harmonicity
Sound Source Models - Dan Ellis
2007-05-24 - 17/30
CASA limitations
• Limitations of T-F masking
cannot undo overlaps – leaves gaps
















































• Typically driven by local features
limited model scope ➝ no inference or illusions
• Processing hand-defined, not learned
Sound Source Models - Dan Ellis
2007-05-24 - 18/30
from
Hu &
Wang ’04
huwang-v3n7.wav
Can Models Do CASA?
• Source models can learn harmonicity, onset
freq / mel band
... to subsume rules/representations of CASA
VQ800 Codebook - Linear distortion measure
80
-10
-20
-30
-40
-50
60
40
20
0
100
200
300
400
500
600
700
800
level / dB
can capture spatial info too [Pearlmutter & Zador’04]
• Can also capture sequential structure
e.g. consonants follow vowels
... like people do?
• But: need source-specific models
... for every possible source
use model adaptation? [Ozerov et al. 2005]
Sound Source Models - Dan Ellis
2007-05-24 - 19/30
Separation or Description?
• Are isolated waveforms required?
clearly sufficient, but may not be necessary
not part of perceptual source separation!
• Integrate separation with application?
e.g. speech recognition
separation
mix
t-f masking
+ resynthesis
identify
target energy
ASR
speech
models
words
mix
vs.
identify
find best
words model
speech
models
source
knowledge
words output = abstract description of signal
Sound Source Models - Dan Ellis
2007-05-24 - 20/30
words
Dictionary Models
Roweis ’01, ’03
Kristjannson ’04, ’06
• Given models for sources,
find “best” (most likely) states for spectra:
combination
p(x|i1, i2) = N (x; ci1 + ci2, Σ) model
inference of
{i1(t), i2(t)} = argmaxi1,i2 p(x(t)|i1, i2) source
state
can include sequential constraints...
different domains for combining c and defining 
• E.g. stationary noise:
In speech-shaped noise
(mel magsnr = 2.41 dB)
freq / mel bin
Original speech
80
80
80
60
60
60
40
40
40
20
20
20
0
1
2 time / s
0
Sound Source Models - Dan Ellis
1
2
VQ inferred states
(mel magsnr = 3.6 dB)
0
1
2
2007-05-24 - 21/30
Speech Recognition Models
• Cooke & Lee Speech Separation Challenge
short, grammatically-constrained utterances:
<command:4><color:4><preposition:4><letter:25><number:10><adverb:4>
e.g. "bin white by R 8 again"
task: report letter+number for “white”
• Decode with Factorial HMM
i.e. two state sequences, one model for each voice
exploit sequence constraints
exploit speaker differences
• IBM “superhuman” system
Kristjansson, Hershey et al. ’06
fewer errors than people for same speaker, level
exploits known speakers, limited grammar
Sound Source Models - Dan Ellis
2007-05-24 - 22/30
Speaker-Adapted (SA) Models
•
)*+,-.*/0,1
(%%
Factorial HMM needs
distinct speakers
Mixture: t32_swil2a_m18_sbar9n
0
6
4
2
0
'%
!20
!40
%
Adaptation iteration 1
0
!20
(%%
Adaptation iteration 3
0
6
4
2
0
23341*35
!40
!20
'%
!40
Adaptation iteration 5
%
0
6
4
2
0
!"#
$"#
!20
!40
SD model separation
0
6
4
2
0
use “eigenvoice” speaker space
iterate
estimating
voice!!"#
& !&"#
!"#
$"#
%"#
!$"#
)*+,-6,7",1
separating speech
SI
performs midway between
speaker-independent (SI) and SA
speaker-dependent (SD)
!20
%"#
!$"#
!!"#
89::-6,7",1
!&"#
SD
-
(%%
acc %
Frequency (kHz)
6
4
2
0
Ron Weiss
'%
!40
0
0.5
1
1.5
Time (sec)
Sound Source Models - Dan Ellis
%;1*3/,
)8
)2
)<
2007-05-24 - 23/30
#*=,/97,
(Pitch) Factored Dictionaries
• Separate representations for
Ghandi & Has-John. ’04
Radfar et al. ‘06
“source” (pitch) and “filter”
NM codewords
from N+M entries
but: overgeneration...
• Faster search
direct extraction of pitches
immediate separation of
(most of) spectra
Sound Source Models - Dan Ellis
2007-05-24 - 24/30
Outline
1.
2.
3.
4.
Mixtures & Models
Human Sound Organization
Machine Sound Organization
Ambient Sounds
binaural separation
“personal audio” analysis
Sound Source Models - Dan Ellis
2007-05-24 - 25/30
Binaural Localization by EM
Mike Mandel, NIPS’06
• 2 or 3 sources in reverberation
• Iteratively estimate ILD, IPD
initialize from PHAT ITD histogram
output is soft TF mask
!"#$%&%'()
3"4567%8"59:
;<=8
>(?8%(@A94B"CD
2
1
,
0
/
.
-
-
*+N
*+,
-
-+,
.
!"#$%&%'()
=EFGH;
*+,
-
-+,
.
=EF-GH;%I9@#7%!J
*+,
-
-+,
.
*+1
*+0
=EKGH;%IG>;%46LMJ
*+.
2
*
1
,
0
/
.
*+,
-
-+,
.
*+,
Sound Source Models - Dan Ellis
-
-+,
.
*+,
-
-+,
.
9@D#%&%A
2007-05-24 - 26/30
“Personal Audio” Archives
• Continuous recordings
with MP3 player
• Segment / cluster “episodes”
.. by statistics of ~10 s segments
.. for browsing interface








































 









































Sound Source Models - Dan Ellis
































2007-05-24 - 27/30


Personal Audio Speech Detection
Keansub Lee, Interspeech’06
• Pitch is last speech cue to disappear
freq / kHz
noise robust pitch tracker for voice detection
biggest problem was periodic noise
(air conditioning)
Personal Audio - Speech + Noise
8
0
6
4
-50
2
-100
0
level / dB
Pitch Track + Speaker Active Ground Truth
200
Lags
150
100
ks-noisyspeech.wav
50
0
0
1
2
3
Sound Source Models - Dan Ellis
4
5
6
7
8
9
10
time / sec
2007-05-24 - 28/30
Repeating Events in Personal Audio
Jim Ogle, ICASSP’07
• “Unsupervised” feature to help browsing
• Full NxN search is very expensive
use Shazam fingerprint hashes to find repeats
freq / kHz
Phone ring - Shazam fingerprint
4
-10
-20
3
-30
-40
2
-50
-60
1
-70
0
0
0.5
1
1.5
2
2.5 time / s
-80
level / dB
only works for exact repeats (alarms, jingles)
• O(N) scan for repeats
fixed-size hash table
multiple common hashes ➝ confident match
Sound Source Models - Dan Ellis
2007-05-24 - 29/30
Summary & Conclusions
• Listeners do well separating sound mixtures
using signal cues (location, periodicity)
using source-property variations
• Machines do less well
difficult to apply enough constraints
need to exploit signal detail
• Models capture constraints
learn from the real world
adapt to sources
• Separation feasible only sometimes
describing source properties is easier
Sound Source Models - Dan Ellis
2007-05-24 - 30/30
Download