Sound Organization By Source Models in Humans and Machines 1.

advertisement
Sound Organization
By Source Models
in Humans and Machines
Dan Ellis
Laboratory for Recognition and Organization of Speech and Audio
Dept. Electrical Eng., Columbia Univ., NY USA
dpwe@ee.columbia.edu
1.
2.
3.
4.
http://labrosa.ee.columbia.edu/
Mixtures and Models
Human Sound Organization
Machine Sound Organization
Research Questions
Sound Organization by Models - Dan Ellis
2006-12-09 - 1 /29
The Problem of Mixtures
“Imagine two narrow channels dug up from the edge of a
lake, with handkerchiefs stretched across each one.
Looking only at the motion of the handkerchiefs, you are
to answer questions such as: How many boats are there
on the lake and where are they?” (after Bregman’90)
• Received waveform is a mixture
2 sensors, N sources - underconstrained
• Undoing mixtures: hearing’s primary goal?
.. by any means available
Sound Organization by Models - Dan Ellis
2006-12-09 - 2 /29
Sound Organization Scenarios
• Interactive voice systems
human-level understanding is expected
• Speech prostheses
crowds: #1 complaint of hearing aid users
• Archive analysis
-.$/%&%012
identifying and isolating sound events
dpwe-2004-09-10-13:15:40
;
53
9
3
7
=53
5
3
=73
3
4
5
6
7
8
9
:
;
)$*$)%&%+,
<
!"#$%&%'$(
pa-2004-09-10-131540.wav
• Unmixing/remixing/enhancement...
Sound Organization by Models - Dan Ellis
2006-12-09 - 3 /29
How Can We Separate?
• By between-sensor differences (spatial cues)
‘steer a null’ onto a compact interfering source
the filtering/signal processing paradigm
• By finding a ‘separable representation’
spectral? sources are broadband but sparse
periodicity? maybe – for pitched sounds
something more signal-specific...
• By inference (based on knowledge/models)
acoustic sources are redundant
→ use part to guess the remainder
- limited possible solutions
Sound Organization by Models - Dan Ellis
2006-12-09 - 4 /29
Separation vs. Inference
• Ideal separation is rarely possible
i.e. no projection can completely remove overlaps
• Overlaps → Ambiguity
scene analysis = find “most reasonable” explanation
• Ambiguity can be expressed probabilistically
i.e. posteriors of sources {Si} given observations X:
P({Si}| X)
P(X |{Si}) P({Si})
combination physics source models
• Better source models → better inference
.. learn from examples?
Sound Organization by Models - Dan Ellis
2006-12-09 - 5 /29
A Simple Example
• Source models are codebooks
from separate subspaces


























 


 






Sound Organization by Models - Dan Ellis




 
2006-12-09 - 6 /29
A Slightly Less Simple Example
• Sources with Markov transitions













 












































       

Sound Organization by Models - Dan Ellis








2006-12-09 - 7 /29

What is a Source Model?
• Source Model describes signal behavior
encapsulates constraints on form of signal
(any such constraint can be seen as a model...)
• A model has parameters
model + parameters
→ instance
Excitation
source g[n]
Resonance
filter H(ej!)
n
• What is not a source model?
detail not provided in instance
e.g. using phase from original mixture
constraints on interaction between sources
e.g. independence, clustering attributes
Sound Organization by Models - Dan Ellis
2006-12-09 - 8 /29
n
!
Outline
1. Mixtures and Models
2. Human Sound Organization
Auditory Scene Analysis
Using source characteristics
Illusions
3. Machine Sound Organization
4. Research Questions
Sound Organization by Models - Dan Ellis
2006-12-09 - 9 /29
Auditory Scene Analysis Bregman’90
Darwin & Carlyon’95
• How do people analyze sound mixtures?
break mixture into small elements (in time-freq)
elements are grouped in to sources using cues
sources have aggregate attributes
• Grouping rules (Darwin, Carlyon, ...):
cues: common onset/modulation, harmonicity, ...
Onset
map
Sound
Frequency
analysis
Elements
Harmonicity
map
Sources
Grouping
mechanism
Source
properties
Spatial
map
• Also learned “schema” (for speech etc.)
Sound Organization by Models - Dan Ellis
2006-12-09 - 10/29
(after Darwin
1996)
Perceiving Sources
• Harmonics distinct in ear, but perceived as
freq
one source (“fused”):
time
depends on common onset
depends on harmonics
• Experimental techniques
ask subjects “how many”
match attributes e.g. pitch, vowel identity
brain recordings (EEG “mismatch negativity”)
Sound Organization by Models - Dan Ellis
2006-12-09 - 11/29
Auditory “Illusions”
freq
pulsation threshold
2,,,
%&$'($)*+
• How do we explain illusions?
3,,,
1,,,
0,,,
+
?
,
+
,-.
/
/-.
!"#$
0
0-.
0
!"#$
0-.
4
/
!"#$
/-.
time
.,,,
%&$'($)*+
sinewave speech
,
+
1,,,
4,,,
0,,,
/,,,
,
phonemic restoration
,-.
/
/-.
4-.
1,,,
• Something is providing the
%&$'($)*+
4,,,
0,,,
/,,,
,
,
,-.
missing (illusory) pieces ... source models
Sound Organization by Models - Dan Ellis
2006-12-09 - 12/29
0
Human Speech Separation
• Task: Coordinate Response Measure
Brungart et al.’02
“Ready Baron go to green eight now”
256 variants, 16 speakers
correct = color and number for “Baron”
crm-11737+16515.wav
• Accuracy as a function of spatial separation:
A, B same speaker
Sound Organization by Models - Dan Ellis
o Range effect
2006-12-09 - 13/29
Separation by Vocal Differences
!"#$%&$'()#*"&+$,-"#$'(!$./-#0 Brungart et al.’01
• CRM varying the level and voice character
1232'(4-**2&2#52.
(same spatial
location)
1232'(6-**2&2#52(5$#(72'8(-#(.82257(.20&20$,-"#
energetic
vs. informational masking
more than1-.,2#(*"&(,72(9%-2,2&(,$'/2&:
pitch .. source models
Sound Organization by Models - Dan Ellis
2006-12-09 - 14/29
Outline
1. Mixtures and Models
2. Human Sound Organization
3. Machine Sound Organization
Computational Auditory Scene Analysis
Dictionary Source Models
4. Research Questions
Sound Organization by Models - Dan Ellis
2006-12-09 - 15/29
Source Model Issues
• Domain
parsimonious expression of constraints
nice combination physics
• Tractability
size of search space
tricks to speed search/inference
• Acquisition
hand-designed vs. learned
static vs. short-term
• Factorization
independent aspects
hierarchy & specificity
Sound Organization by Models - Dan Ellis
2006-12-09 - 16/29
Computational Auditory
Scene Analysis Brown & Cooke’94
Okuno et al.’99
Hu & Wang’04 ...
• Central idea:
Segment time-frequency into sources
based on perceptual grouping cues
Segment
input
mixture
Front end
signal
features
(maps)
Group
Object
formation
discrete
objects
Grouping
rules
Source
groups
freq
onset
time
period
frq.mod
... principal cue is harmonicity
Sound Organization by Models - Dan Ellis
2006-12-09 - 17/29
CASA limitations
• Limitations of T-F masking
cannot undo overlaps – leaves gaps
















































• Driven by local features
limited model scope ➝ no inference or illusions
• Does not learn from data
Sound Organization by Models - Dan Ellis
2006-12-09 - 18/29
from
Hu &
Wang ’04
huwang-v3n7.wav
Basic Dictionary Models
Roweis ’01, ’03
Kristjannson ’04, ’06
• Given models for sources,
find “best” (most likely) states for spectra:
combination
p(x|i1, i2) = N (x; ci1 + ci2, Σ) model
inference of
{i1(t), i2(t)} = argmaxi1,i2 p(x(t)|i1, i2) source
state
can include sequential constraints...
different domains for combining c and defining 
• E.g. stationary noise:
.%()*++,-/)-&*+0(%1#)+((
23+'(3&$)%"(4(5678(09:
freq / mel bin
!"#$#%&'()*++,-
80
80
80
60
60
60
40
40
40
20
20
20
0
1
2 time / s
0
1
Sound Organization by Models - Dan Ellis
2
;<(#%=+""+0()>&>+)((
23+'(3&$)%"(4(?6@(09:
0
1
2
2006-12-09 - 19/29
Machine Learning and Applied Statistics
Microsoft Research
traustik@microsoft.com
University of California, San Diego
Machine Perception Lab
jhershey@cogsci.ucsd.edu
Deeper Models: Iriquois
ABSTRACT
There are two observations that motivate the consideration of high frequency resolution modelling of speech for
noise robust speech recognition and enhancement. First is
the observation that most noise sources do not have harmonic structure similar to that of voiced speech. Hence,
voiced speech sounds should be more easily distinguishable from environmental noise in a high dimensional signal
space1 .
We present a framework for speech enhancement and robust speech recognition that exploits the harmonic structure
of speech. We achieve substantial gains in signal to noise ratio (SNR) of enhanced speech as well as considerable gains
in accuracy of automatic speech recognition in very noisy
conditions.
The method exploits the harmonic structure of speech
by employing a high frequency resolution speech model in
the log-spectrum domain and reconstructs the signal from
the estimated posteriors of the clean signal and the phases
from the original noisy signal.
We achieve a gain in signal to noise ratio of 8.38 dB for
enhancement of speech at 0 dB. We also present recognition
results on the Aurora 2 data-set. At 0 dB SNR, we achieve
a reduction of relative word error rate of 43.75% over the
baseline, and 15.90% over the equivalent low-resolution algorithm.
• Optimal inference on mixed
45
speaker-specific models
(e.g. 512 mix GMM)
Algonquin inference
Kristjansson, Hershey
spectra
et al. ’06
Noisy vector, clean vector and estimate of clean vector
40
Energy [dB]
35
30
25
20
x estimate
noisy vector
clean vector
1. INTRODUCTION
15
A long standing goal in speech enhancement and robust
10
speech recognition has been to exploit the harmonic struc0
200
400
600
800
1000
Frequency [Hz]
ture of speech to improve intelligibility and increase recognition accuracy.
Fig. 1. The noisy input vector (dot-dash line), the correThe source-filter model of speech assumes that speech
sponding clean vector (solid line) and the estimate of the
is produced by an excitation source (the vocal cords) which
clean speech (dotted line), with shaded area indicating the
has strong regular harmonic structure during voiced phonemes.
uncertainty of the estimate (one standard deviation). Notice
The overall shape of the spectrum is then formed by a filthat the uncertainty on the estimate is considerably larger in
ter (the vocal tract). In non-tonal languages the filter shape
the valleys between the harmonic peaks. This reflects the
alone determines which phone component
of
a
word
is
proSame Gender Speakerslower SNR in these regions. The vector shown is frame 100
duced (see Figure 2). The source on the other hand introfrom Figure 2
duces fine structure in the frequency spectrum that in many
cases varies strongly among different utterances of the same
A second observation is that in voiced speech, the signal
phone.
power is concentrated in areas near the harmonics of the
This fact has traditionally inspired the use of smooth
fundamental frequency, which show up as parallel ridges in
representations of the speech spectrum, such as the Melfrequency cepstral coefficients, in an attempt to accurately
1 Even if the interfering signal is another speaker, the harmonic structure
estimate the filter component of speech in a way that is inof the two signals may differ at different times, and the long term pitch
variant to the
non-phonetic3 effects
of the excitation[1].
exploited to separate the two sources [2].
6 dB
dB
0 dB
- 3 dB contour-of6the
dBspeakers may- 9bedB
• .. for Speech Separation Challenge (Cooke/Lee’06)
wer / %
exploit grammar constraints - higher-level dynamics
100
50
0
SDL Recognizer
No dynamics
Sound Organization
by Models - Dan Ellis
0-7803-7980-2/03/$17.00 © 2003 IEEE
Acoustic dyn.
291
Grammar dyn.
Human
2006-12-09 - 20/29ASRU 2003
#"2,4/*M*-"*#%-35*/"8&3,*9,%-8&,/*X
!"#$%&'()*+,)&,)%-'"(*.%/0/
M ! = argmax P " M X # = argmax P " X M # !
M
M
+-%(2%&2*34%//'!3%-'"(*35""/,/*6,-7,,(*
Faster1 Search:
Fragment
Decoder
#"2,4/*M*-"*#%-35*/"8&3,*9,%-8&,/*X
1 .':-8&,/;*"6/,&<,2*9,%-8&,/*Y=*/,)&,)
•
et al. ’05
P " MBarker
#
!
------------M
= argmax
P " M %44*&,4%-,2*6>*P(
X # = argmax P " X MX# |!Y,S
)*;
Match
‘uncorrupt’
P" X #
M
M
spectrum
to
ASR
1 .':-8&,/;*"6/,&<,2*9,%-8&,/*Y=*/,)&,)%-'"(*S=*Observation
models
using
Y(f )
%44*&,4%-,2*6>*P(
X | Y,S )*;
missing data
Observation
Source
Y(f )
recognition
X(f )
easy if you know the
segregation mask S
Source
Segregation S
X(f )
freq
M and segregation S
1 model
?"'(-*34%//'!3%-'"(*"9*#"2,4*%(2*/,)&
• Joint search for
Segregation S
freq
1 ?"'(-*34%//'!3%-'"(*"9*#"2,4*%(2*/,)&,)%-'"(;
to maximize:
P" X Y $ S #
P
"
X
Y
$
S
#
- dX !
S# YP#" X=M P
"-----------------------M # % P "-XdXM
#" S! -----------------------P " M $ S Y #P=" M
P "$M
#
!
!
P
Y
#
%
P" X #
P" X #
Isolated Source Model
Segregation Model
<$P(X)$#*$&*#521$=*#(0"#0$>
<$P(X)$#*$&*#521$=*#(0"#0$>
Sound
Organization by Models
- Dan Ellis
!"#$%&&'(
)*+#,-$.'/0+12($3$42"1#'#5
!"#$%&&'(
2006-12-09 - 21/29
67789::976$9$::$;$68
)*+#,-$.'/0+12($3$42"1#'#5
67789::976$
Source
X(f )
Segregation S
freq
CASA
in the Fragment Decoder
1 ?"'(-*34%//'!3%-'"(*"9*#"2,4*%(2*/,)&,)%-'"(;
P" X Y $ S #
!"#$%&'()(&*+,-./+"
P " M $ S Y # = P " M # % P " X M # ! ------------------------- dX ! P " S Y #
P" X #
0 P(S|Y)&1#$2"&,34."-#3&#$*4/5,-#4$&-4&"+%/+%,-#4$
•
9<$P(X)$#*$&*#521$=*#(0"#0$>
'($0<'($(25125"0'*#$=*10<$>*#(',21'#5?
CASA can
help search
9 <*=$&'@2&A$'($'0?
•
E12F+2#>A$G"#,($G2&*#5$0*520<21
CASA can
rate segregation
9 *#(20;>*#0'#+'0AD
consider
only segregations
made from
CASA
!"#$%&&'(
)*+#,-$.'/0+12($3$42"1#'#5
67789::976$9$::$;$68
0 67+$#$%&*4/&'()(8"-91+&143,1&*+,-./+"
chunks 9 B21'*,'>'0A;<"1C*#'>'0AD
construct0'C29E12F+2#>A$125'*#$C+(0$G2$=<*&2
P(S|Y) to reward CASA qualities:
Frequency Proximity
Common Onset
Sound Organization by Models - Dan Ellis
!"#$%&&'(
)*+#,-$.'/0+12($3$42"1#'#5
Harmonicity
2006-12-09 - 22/29
67789::976$9$:8$;$68
(Pitch) Factored Dictionaries
• Separate representations for
Ghandi & Has-John. ’04
Radfar et al. ‘06
“source” (pitch) and “filter”
NM codewords
from N+M entries
but: overgeneration...
• Faster search
direct extraction of pitches
immediate separation of
(most of) spectra
Sound Organization by Models - Dan Ellis
2006-12-09 - 23/29
Discriminant Models for Music
Poliner & Ellis ’06
• Transcribe piano recordings by classification
train SVM detectors for every piano note
88 separate detectors, independent smoothing
• Trained on player piano recordings
freq / pitch
Bach 847 Disklavier
A6
20
A5
10
A4
0
A3
A2
-10
A1
-20
0
1
2
3
4
5
6
7
8
level / dB
9
time / sec
• Can resynthesize from transcript...
Sound Organization by Models - Dan Ellis
2006-12-09 - 24/29
Piano Transcription Results
improvement from classifier:
• Significant
Table 1: Frame level transcription results.
frame-level accuracy results:
Algorithm
SVM
Klapuri&Ryynänen
Marolt
Errs
43.3%
66.6%
84.6%
120
False Pos
27.9%
28.1%
36.5%
False Neg
15.4%
38.5%
48.1%
d!
3.44
2.71
2.35
False Negatives
False Positives
Classification error %
Breakdown
• Overall Accuracy
by frameAcc: Overall accuracy is a frame-level version of the metric
proposed by Dixon in [Dixon, 2000] defined as:
type:
100
80
60
40
N
Acc =
(F
P 5 +6 F7 N8 + N )
0
1
2
3
4
20
(3)
# notes present
where Sound
N isOrganization
the number
of correctly
transcribed frames,
F P- 25
is/29
the number of
by Models
- Dan Ellis
2006-12-09
unvoiced frames U V transcribed as voiced V , and F N is the number of voiced
frames transcribed as unvoiced.
Outline
1.
2.
3.
4.
Mixtures & Models
Human Sound Organization
Machine Sound Organization
Research Questions
Task and Evaluation
Generic vs. Specific
Sound Organization by Models - Dan Ellis
2006-12-09 - 26/29
Task & Evaluation
• How to measure separation performance?
depends what you are trying to do
• SNR?
energy (and distortions) are not created equal
different nonlinear components [Vincent et al. ’06]
rare for nonlinear processing
to improve intelligibility
listening tests expensive
• ASR performance?
Net
effect
Transmission
errors
• Human Intelligibility?
Increased
artefacts
Reduced
interference
optimum
Agressiveness
of processing
separate-then-recognize too simplistic;
ASR needs to accommodate separation
Sound Organization by Models - Dan Ellis
2006-12-09 - 27/29
/a/
/e/
How Many Models?
➝ better separation
• More specific models
/i/
need individual dictionaries
for “everything”??/o/
Domain of normal speakers
Model
adaptation
and
hierarchy
•
/u/
extrapolation beyond normal
VT length ratio
speaker adapted models :
base + parameters
Smith, Patterson
et al. ’05
mean
generic-specific: pitched ➝ piano ➝ this piano
pitch / Hz
• Time scales of model acquisition
innate/evolutionary (hair-cell tuning)
developmental (mother tongue phones)
dynamic - the “slung mugs” effect; Ozerov
Sound Organization by Models - Dan Ellis
2006-12-09 - 28/29
Summary & Conclusions
• Listeners do well separating sound mixtures
using signal cues (location, periodicity)
using source-property variations
• Machines do less well
difficult to apply enough constraints
need to exploit signal detail
• Models capture constraints
learn from the real world
adapt to sources
• Separation feasible only sometimes
describing source properties is easier
Sound Organization by Models - Dan Ellis
2006-12-09 - 29/29
Download