Columbia: Recent + Future • - More information

advertisement
Columbia: Recent + Future
information
• -More
FDLP / PLP2 features
classifiers
• -Better
MI-based broad-class experts
variability
• -Reducing
Temporal variation
-
Formant “automatic gain control” (AGC)
model
• -Signal
“Deformable spectrograms”
EARS Retreat - Dan Ellis
2004-08-05
Broad-Class Experts
Patricia Scanlon
• MI-based feature masks make superior classspecific classifiers (vowels, stops...)
• smaller models: good for data-limited case
• Apply to ASR by ‘patching in’ probabilities via
separate broad-class center detector
EARS Retreat - Dan Ellis
2004-08-05
MI-Based Class Experts
• Idea: Different speech
sounds have different
information distribution
-
.. as identified by
MI to phone | class
for reducing model complexity
• -Good
benefits disappear given enough data
measuring joint MI
• -Not
quick hack: checkerboard
EARS Retreat - Dan Ellis
2004-08-05
Broad-Class Detector
gives Pr(phone | class, features)
• -Expert
still need Pr(class | features)
same approach
• -Repeat
separate detectors for each broad class
-
measure MI from TF cell to class
train MLP from those features
-
reasonable vowel recognition with
10% insertions (6.3% deletions) of centers
accept/false detect tradeoff
• -False
try to detect only center of phone
EARS Retreat - Dan Ellis
2004-08-05
Overall System
• ‘Patch in’ expert’s posteriors:
P(qi|X) =
-
! P(qi|class, X) · P(class|X)
class
‘non-expert’ MLP for when P(class|X) are small
TIMIT phone err rate
Baseline:
28.4%
Oracle P(VC): 26.9%
Real P(VC): 28.0%
Vowels+Frics: 27.6%
at:
• -Stillusinglooking
more experts
-
better P(class|X)
EARS Retreat - Dan Ellis
2004-08-05
Temporal Variation
• Idea:
Sambarta Bhattacharjee
Banky Omodunbi
Normalize phone durations by averages
-
.. to reveal per-speaker bias
.. and timing variation within phrases
on vowels
• -Focus
per-phone
deviations
are very noisy
• Use to vary sampling/modeling?
EARS Retreat - Dan Ellis
2004-08-05
Formant AGC
Eric Fuller
Sambarta Bhattacharjee
• Hypothesis:
Casual speech has ‘compressed’ formant motion
-
can we ‘enhance’ format motions
to make speech more canonical / read-like?
freq / kHz
Rhonda Speaking 5
4
3
2
1
freq / Bark
0
15
10
freq / Bark
5
15
10
5
0
0.5
EARS Retreat - Dan Ellis
1
1.5
2
2.5
3
3.5
4
4.5
time / s
2004-08-05
Read vs. Spontaneous
• Speaker-dependent means, vars of PLP pole
locations in read vs. spontaneous speech
-
variance of angle of pole 3 discriminates well for
red and green speakers - but opposite changes!
EARS Retreat - Dan Ellis
2004-08-05
Deformable Spectra
• Accurate spectral modeling in
Nebojsa Jojic (MSR)
Manuel Reyes
conventional HMMs requires 1000s of states
-
cumbersome, especially transition matrices
• Observation:
Speech spectra undergo minor deformations
-
suggests a different generative model:
9
X9t
Xt-1
8
Xt-1
X8t
Transformation
7
matrix T
Xt-1
X7t
6
Xt-1
Xt6
00100
5
= Xt5
0 0 0 1 0 • Xt-1
4
00001
Xt-1
Xt4
3
X3t
NP=5 Xt-1
2
Xt-1
X2t
1
Xt-1
X1t
EARS Retreat - Dan Ellis
NC=3
2004-08-05
States+Transformation Model
• Time-frequency state grid
→
• -State
explicit prototype
a)
-
•
or a transformation
on prior frame
b)
T51
T41
X51
X50
T!31
X41
T!22
T!21
X13
X30
X21
X20
frequency
X10
T12
time
Yellow/Orange:
Upward motion
(darker is steeper)
3
b) Transformation Map
"
"#$
T!11
Green:
Identity transform
2
EARS Retreat - Dan Ellis
T13
X11
1
a) Signal
%
T!23
X40
Infer underlying states
!
!
Blue:
Downward motion
(darker is steeper)
2004-08-05
Two-layer model
decomposition
• -Source-filter
pitch and formants have different dynamics
transformation models for both
• -Apply
log-spectra:
-
sum of excitation & filter
inference does separation
!#
!"
"
'
(
$
$%&
=
Signal
Selected Bin
EARS Retreat - Dan Ellis
+
Harmonics
Harmonic Tracking
$
$%&
Formants
Formant Tracking
2004-08-05
Download