Model-Based Scene Analysis 1. 2.

advertisement
Model-Based
Scene Analysis
Dan Ellis
Laboratory for Recognition and Organization of Speech and Audio
Dept. Electrical Eng., Columbia Univ., NY USA
dpwe@ee.columbia.edu
http://labrosa.ee.columbia.edu/
1. Separation and Inference
2. Model-based separation
3. Speech Fragment Decoder
Model-Based Scene Analysis - Dan Ellis
2005-06-30 - 1 /12
1. Separation and Inference
• Full separation requires “separable dimension”
e.g. spatial filtering
but for single channel: overlap is inevitable
• Signal knowledge provides extra constraints
.. for inference of missing parts
• Separation vs. recognition
separation is sufficient .. but too hard
recognition is easier .. but too coarse
in-between: class plus parameters
adequate for resynthesis?
Model-Based Scene Analysis - Dan Ellis
2005-06-30 - 2 /12
Pattern Recognition Perspective
• Inferring source signal set {si}
from mixture signal x:
arg max p( x|{ si} ) Σ p( si|Mi)
{ si }
i
p( x|{ si} ) gives physics of combination (sum)
p( si|Mi) limits which si to consider
• How to acquire/evaluate p( si|Mi) ?
generalize observation of solo sources
to search {si}?
• How
full joint space?
clever pruning tricks
Model-Based Scene Analysis - Dan Ellis
2005-06-30 - 3 /12
Factorial HMM - Toy Example
• Two sources with same underlying model
01(&$2*).",'(&3#&,%&'4(#+'"5'1")6'("#$%&(7
%$#"&'$()*+$,!*-,
%$#"&'$()*+$,!*-,
!"#$%&'()*)&'+&*,(
2
1
0
/
.
!)*)&
-.*/$*+
.
/
0
1
2
3
4
5 !"#"$
.
/
0
1
2
3
4
5
2
1
0
/
.
2
.6
.2
/6
/2
"*+$
06
!"#"$
!"#"$(#"("*+$(!
8,5&$$&-'()*)&(
5
4
3
2
1
0
/
.
5
4
3
2
1
0
/
.
. / 0 1 2 3 4 5
!"#"$(#"("*+$(!!"!#
7-&'8$(9
7-&'8$(:
2
.6
.2
/6
/2
"*+$
06
sequence constraints can disambiguate
identical emissions
Model-Based Scene Analysis - Dan Ellis
2005-06-30 - 4 /12
MAXVQ Results: Denoising
Disambiguating with Knowledge
(Roweis ’03)
frequency
frequency
• Use strength of match to models as
reasonableness measure for control
• e.g. MAXVQ
time
time time
time time
Noise-corrupt
speech
Matching
templates
Training: 300sec of isolated speech (TIMIT) to fit 512 codewords,
and 100sec of
quency
quency
solated noise (NOISEX)
fit 32
codewords;
at 0dB
SNR
Model-Based to
Scene
Analysis
- Dan Ellistesting on new signals
2005-06-30
- 5 /12
from Sam Roweis’s
Montreal 2003
presentation
frequency
frequency
frequency
frequency
learn dictionary of spectrogram slices
find the ones that ‘fit’
- or max() of a combination....
MAXVQ Results: Denoising
...timethen filter out excess energy
time
frequency
frequency
time
Full Mixture Inference
ABSTRACT
We present a framework for speech enhancement and robust speech recognition that exploits the harmonic structure
of speech. We achieve substantial gains in signal to noise ratio (SNR) of enhanced speech as well as considerable gains
in accuracy of automatic speech recognition in very noisy
conditions.
The method exploits the harmonic structure of speech
by employing a high frequency resolution speech model in
the log-spectrum domain and reconstructs the signal from
the estimated posteriors of the clean signal and the phases
from the original noisy signal.
We achieve a gain in signal to noise ratio of 8.38 dB for
enhancement of speech at 0 dB. We also present recognition
results on the Aurora 2 data-set. At 0 dB SNR, we achieve
a reduction of relative word error rate of 43.75% over the
baseline, and 15.90% over the equivalent low-resolution algorithm.
There are two observations that motivate the consideration of high frequency resolution modelling of speech for
noise robust speech recognition and enhancement. First is
the observation that most noise sources do not have harmonic structure similar to that of voiced speech. Hence,
voiced speech sounds should be more easily distinguishable from environmental noise in a high dimensional signal
space1 .
(Kristjansson, Attias, Hershey’04)
• Can model combination of magnitude
spectra with stochastic model
phase cancellation as noise...
45
• Precise inference of
•
by iterative linearization
Works well
(for small domains?)
40
35
Energy [dB]
components
Noisy vector, clean vector and estimate of clean vector
30
25
20
x estimate
noisy vector
clean vector
1. INTRODUCTION
%&$'($)*+
A long standing goal in speech enhancement and robust
speech recognition has been to exploit the harmonic structure
of speech to improve intelligibility and increase recog0,,,
nition accuracy.
/,,,
The source-filter model of speech assumes that speech
is produced by an excitation source (the vocal cords) which
.,,,
has strong regular harmonic structure during voiced phonemes.
The
overall shape of the spectrum is then formed by a fil-,,,
ter (the vocal tract). In non-tonal languages the filter shape
alone,,determines
- which.phone component
/
0 of a word
1 is pro-2
!"#$ introduced (see Figure 2). The source on the other hand
duces fine structure in the frequency spectrum that in many
Model-Based
Scene Analysis - Dan Ellis
cases varies strongly among different utterances of the same
phone.
This fact has traditionally inspired the use of smooth
15
10
0
200
400
600
Frequency [Hz]
800
1000
Fig. 1. The noisy input vector (dot-dash line), the corresponding clean vector (solid line) and the estimate of the
clean speech (dotted line), with shaded area indicating the
uncertainty of the estimate (one standard deviation). Notice
that the uncertainty on the estimate is considerably larger in
the valleys between the harmonic peaks. This reflects the
3
4
5
-,
lower
SNR in
these regions.
The
vector shown is frame 100
from Figure 2
2005-06-30 - 6 /12
A second observation is that in voiced speech, the signal
power is concentrated in areas near the harmonics of the
2. Missing Data Recognition
!
"#$$#%&!'()(!*+,-&%#)#-%
•
•
!
.
!
!! ! !!
/0++,1!2-3+4$!p(x|m)!(5+!264)#3#2+%$#-%(4777
Speech models
p(x|M) are multidimensional...
9
'<2<!=2#$(-!>#1'#$?2(!@*1!2>21A!@12B<!?C#$$2&
"#$$#%&!'()(!*+,-&%#)#-%
means,
variances
for each frequency channel
9
$22,!>#&+2(!@*1!#&&!,'=2$('*$(!0*!520!p(•)
. /0++,1!2-3+4$!p(x|m)!(5+!264)#3#2+%$#-%(4777
need
values for all dimensions to get p(•)
9 '<2<!=2#$(-!>#1'#$?2(!@*1!2>21A!@12B<!?C#$$2&
.
86)9!,(%!+:(46()+!-:+5!(!
But: can evaluate
over a subset
$6;$+)!-<!3#2+%$#-%$!x
k
. 86)9!,(%!+:(46()+!-:+5!(!
x
p(x ,x )
of dimensions
p$6;$+)!-<!3#2+%$#-%$!
! x k m " Z= $xpk !xkx k# x yu m " dx u
p ! x k m " = $ p ! x k# x u m " dx u
p(x |M) = p(x , x |M)dx
9 $22,!>#&+2(!@*1!#&&!,'=2$('*$(!0*!520!p(•)
u
k
k
k u
u
=+%,+>!
. =+%,+>!
2#$$#%&!3()(!5+,-&%#)#-%9
?5+$+%)!3()(!2($@
p(xk|xu<y )
y
xk
p(xk )
p(xk|xu<y )
P(x | q) =
1|
?5+$+%)!3()(!2($@
· P(x2 | q)
P(x
dimension !
p(xk,xu)
u
data recognition:
• Hence, missing
2#$$#%&!3()(!5+,-&%#)#-%9
.
xu
q)
· P(x3 | q)
· P(x4 | q)
· P(x5 | q)
· P(x6 | q)
P(x | q) =
dimension !
P(x1 | q)
· P(x2 | q)
· P(x3 | q)
time !
· P(x4 | q)
9 C#1,!D#10!'(!!$,'$5!0C2!=#(E!F(25125#0'*$G
hard
part
is
finding
the
mask
(segregation)
"#$!%&&'(
)*+$,-!.'/0+12(!3!42#1$'$5
67789::976!9!:7!;!68
· P(x5 | q)
· P(x6 | -q)7 /12
Model-Based Scene Analysis - Dan Ellis
2005-06-30
time !
xk
p(xk )
1
M ! = argmax P " M X # = argmax P " X M
M
M
+-%(2%&2*34%//'!3%-'"(*35""/,/*6,-7,,(*
#"2,4/*M*-"*#%-35*/"8&3,*9,%-8&,/*X
The Speech
Fragment
Decoder
1 .':-8&,/;*"6/,&<,2*9,%-8&,/*Y=*/,)
P" M #
!
M = argmax P " M X # = argmax P " X M #Barker,
! ------------Cooke,
Ellis ’04
%44*&,4%-,2*6>*P(
X | Y,S
P " X)*;#
M
M
• Match ‘uncorrupt’
1
.':-8&,/;*"6/,&<,2*9,%-8&,/*Y=*/,)&,)%-'"(*S=*
Observation
%44*&,4%-,2*6>*P( X | Y,S )*;
Y(f )
spectrum to ASR
models using
missing data
•
Observation
Y(f )
Source
X(f )
Source
X(f ) Segregation S
freq
Joint search for model M and freq
segregation S
1 ?"'(-*34%//'!3%-'"(*"9*#"2,4*%(2*/
to maximize:
1 ?"'(-*34%//'!3%-'"(*"9*#"2,4*%(2*/,)&,)%-'"(;
P" X Y $ S #
P" X Y $ S #
P " M $ S Y #P=" M
P "$MS# %YP#" X=M P
# !"-----------------------! P#" S! -----------------------Y#
-d
MP#"%XP# "-XdX M
P
"
X
#
Isolated Source Model
Segregation Model
Segregation S
<$P(X)$#*$&*#521$=*#(0"#0$>
<$P(X)$#*$&*#521$=*#(0"#0$>
!"#$%&&'(
Model-Based
Scene Analysis)*+#,-$.'/0+12($3$42"1#'#5
- Dan Ellis
!"#$%&&'(
2005-06-30 - 8 /12
67789::976$9$::$;$68
)*+#,-$.'/0+12($3$42"1#'#5
67789
Segregation S
1
freq
Using CASA cues
?"'(-*34%//'!3%-'"(*"9*#"2,4*%(2*/,)&,)%-'"(;
P" X Y $ S #
!"#$%&'()(&*+,-./+"
P " M $ S Y # = P " M # % P " X M # ! ------------------------- dX ! P " S Y #
P" X #
0 P(S|Y)&1#$2"&,34."-#3&#$*4/5,-#4$&-4&"+%/+%,-#4$
• CASA helps search
9<$P(X)$#*$&*#521$=*#(0"#0$>
'($0<'($(25125"0'*#$=*10<$>*#(',21'#5?
9 <*=$&'@2&A$'($'0?
consider
only segregations
made from
CASA
!"#$%&&'(
)*+#,-$.'/0+12($3$42"1#'#5
67789::976$9$::$;$68
0 67+$#$%&*4/&'()(8"-91+&143,1&*+,-./+"
chunks 9 B21'*,'>'0A;<"1C*#'>'0AD
• CASA rates segregation
E12F+2#>A$G"#,($G2&*#5$0*520<21
9 *#(20;>*#0'#+'0AD
construct0'C29E12F+2#>A$125'*#$C+(0$G2$=<*&2
P(S|Y) to reward CASA qualities:
Frequency Proximity
Common Onset
Model-Based Scene Analysis - Dan Ellis
!"#$%&&'(
)*+#,-$.'/0+12($3$42"1#'#5
Harmonicity
2005-06-30 - 9 /12
67789::976$9$:8$;$68
Learning for Separation
• Control: learn what is “reasonable”
• Input: discriminant features
learned subspaces
• Engine: clustering parameters
• Output: restoration...
Model-Based Scene Analysis - Dan Ellis
2005-06-30 - 10/12
Can Machine Learning
Subsume CASA?
• ASA grouping cues describe real sounds
.. “anecdotally”
• Machine Learning is another way to find
regularities in large datasets
can, e.g., Roweis templates subsume harmonicity,
onset, etc.?
... and handle schema at the same time?
“cut out the (grouping cue) middleman”
• Trick is how to represent/generalize
listeners can organize novel sounds
Model-Based Scene Analysis - Dan Ellis
2005-06-30 - 11/12
Conclusions
• Source separation needs constraints
e.g. prior knowledge of signal form
• Memorized signals (HMMs) can be powerful
but can get very large
• Speech recognition models can be co-opted
e.g. to identify plausible subsets of regions
Model-Based Scene Analysis - Dan Ellis
2005-06-30 - 12/12
Download