1 Model the Whole Speech Signal
2 Handle Mixtures
3 Respect Diversity
4 Other Remarks
Dan Ellis <dpwe@ee.columbia.edu>
Laboratory for Recognition and Organization of Speech and Audio
(Lab ROSA )
Columbia University, New York http://labrosa.ee.columbia.edu/
Dan Ellis Ideas for Next-Generation ASR (1 of 21) 2003-10-07
Outline
1 Model the Whole Speech Signal
- Channel, accent, style
- Timing/rate variation
- Coarticulation
2 Handle Mixtures
3 Respect Diversity
4 Other Remarks
Dan Ellis Ideas for Next-Generation ASR (2 of 21) 2003-10-07
1
• HMM is a relatively weak model for speech
4
3
2
1
0
0 1 2 3 4 5 time / sec
- generates something speechlike, but
- missing detail of real speech (...)
- exponential segment durations
• Only meant for inference of p ( X | M )
- to choose between a few M s
• What would it take to model entire signal?
- capture perceptually sufficient information
- e.g. speech coding quality
Dan Ellis Ideas for Next-Generation ASR (3 of 21) 2003-10-07
• Factors affecting spectral distributions
- just absorbed into model variance?
- or: adapted generically e.g. MLLR
• Channel
- typically fixed per session
• Accent
- typically fixed per speaker
• Style
- particular to application?
• Modeling these factors explicitly
- improves generalization
reduces variance of models, hence..
- allows better discrimination of voice from other random stuff
Dan Ellis Ideas for Next-Generation ASR (4 of 21) 2003-10-07
• Timing is weakly modeled with current HMMs
- duration models have little influence on WER
• Some baseline variation
- small improvements with rate-dependent models
• Greatest variation is within phrase ?
4000
2000
0
0 s ow
SO
Dan Ellis
0.5
1.0
1.5
2.0
2.5
3.0
ay th aa dx ax b aw ax th n ay th ih n k ih t s t ih
I ABOUT
THOUGHT THAT
I IT'S
THINK
STILL
AND l p aa s b ax l
POSSIBLE
- models for within-phrase rate-contours
- use as constraints on recognition
(contrast likelihoods of alternate hypotheses e.g. in rescoring)
Ideas for Next-Generation ASR (5 of 21) 2003-10-07
• HMMs are piecewise-constant feature models
- ... at least with Viterbi decoding
delta features can map trajectories to more constant values, but just a ‘patch’
• More states, more context-dependence reduces mismatch
- but model is too general: adjacent states are in fact strongly related
- consequence: insatiable hunger for 1000s of hours of training data
• Generative models of coarticulation not particularly hard
- e.g. HDM, SSM (Deng, Bridle, ...)
inference is hard...
- ... but many new techniques from Machine
Learning community (MCMC, variational, ...)
Dan Ellis Ideas for Next-Generation ASR (6 of 21) 2003-10-07
Outline
1 Model the Whole Speech Signal
2 Handle Mixtures
- Frontier applications
- Auditory Scene Analysis
- Multisource models
3 Respect Diversity
4 Other Remarks
Dan Ellis Ideas for Next-Generation ASR (7 of 21) 2003-10-07
2
• Historically, speech recognition was made tractable by limiting domain to pure speech
- limited problem still hard enough
- as a consequence: systems discard information that distinguishes speech/nonspeech
• Many (most?) “ frontier applications ” involve nonspeech and mixtures
- meeting recordings
- multimedia indexing
- “Lifelog” audio diary
Dan Ellis Ideas for Next-Generation ASR (8 of 21) 2003-10-07
• Separate signals , then recognize
- Computational Auditory Scene Analysis (CASA),
Independent Component Analysis
- nice, if you can make it work
• Recognize combined signal
- ‘multicondition training’
- combinatorics seem daunting
• Recognize with parallel models
- optimal inference from full joint state-space
) Æ (
O
)
- or: skip obscured fragments, infer from higher-level context
- or do both: missing-data recognition
Dan Ellis Ideas for Next-Generation ASR (9 of 21) 2003-10-07
• Can evaluate speech models p ( x | m ) over a subset of k dimensions x k m
)
k
, x u m
) d x u x u y
p(x k
| x u
< y )
p(x k
,x u
)
p(x k x k
)
• Hence, missing data recognition :
Present data mask
P(x | q) =
P(x
1
· P(x
2
| q)
| q)
· P(x
3
· P(x
4
· P(x
5
· P(x
6
| q)
| q)
| q)
| q)
.. but need segregation mask time
P
(
M
,
• Fit model and segregation given obs’n :
S Y
)
P M
( )
P X Y
,
S
) d X
P S Y
)
Dan Ellis Ideas for Next-Generation ASR (10 of 21) 2003-10-07
• Search for more than one source
Y(t) S
2
(t) q
2
(t)
S
1
(t)
• Mutually-dependent data masks
• Use e.g. CASA features to propose masks
- locally coherent regions
• Lots of issues in models, representations, matching, inference...
Dan Ellis Ideas for Next-Generation ASR (11 of 21) 2003-10-07 q
1
(t)
Outline
1 Model the Whole Speech Signal
2 Handle Mixtures
3 Respect Diversity
- Class-specific classifiers
- Using different information differently
4 Other Remarks
Dan Ellis Ideas for Next-Generation ASR (12 of 21) 2003-10-07
3
• Speech signal is very diverse
- different kinds of phonemes (vowels, stops...)
- different kinds of information (lexical, affective...)
- different timescales (phones, words, phrases...)
• Information needs are diverse
- phoneme classification
- syllable detection
- phrase detection
• Technical approaches are diverse
- the more different they are, the bigger the gain from combination
- ‘Rover effect’
Dan Ellis Ideas for Next-Generation ASR (13 of 21) 2003-10-07
(Scanlon & Ellis, Eurospeech ’03)
• Mutual Information in time-frequency:
All phones Stops
15
10
5
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0.04
0.03
0.02
0.01
0
Vowels Speaker (vowels)
0.025
0.08
15
0.02
0.06
10
0.015
0.04
0.01
0.02
5
-200 -100 0 100 200 -200 -100 0
0.005
MI / bits
100 200
/
• Use to select classifier input features time / ms
IRREG IRREG+CHKBD
70
15
10
5
133 ftrs
85 ftrs
65
15
10
5
95 ftrs
47 ftrs
60
55
15
10
5
57 ftrs
28 ftrs
15
10
5
19 ftrs
-50 0
9 ftrs
50 time / ms
50
45
0 20 40 60 80
IRREG+CHKBD
RECT+CHKBD
IRREG
RECT
RECT, all frames
100 120 140
Number of features
-50 0 50
Dan Ellis Ideas for Next-Generation ASR (14 of 21) 2003-10-07
• Integrating paradigms have lots of power
- e.g. HMM does it all: time warp ... LM
... but can we gain by breaking up the tasks?
- separate vowel center detection
& consonant “adornment” classification
Æ
“ Event-based ” recognition
- separate specialized detectors
15 test/dr1/faks0/sa2
10
- acoustic-phonetic features
.. or data-derived nearequivalents from e.g. Independent
Component Analysis
Dan Ellis
5
8
6
4
2
0
1
0
-1
4
2
0
8
6 ow ae iy eh r oy
Ideas for Next-Generation ASR (15 of 21) ay dh
-1
0 time / labels
2003-10-07
2
1
0
Basis vectors
5 10 15 frequency / Bark
20
0
1
2
3
4
Outline
1 Model the Whole Speech Signal
2 Handle Mixtures
3 Respect Diversity
4 Other Remarks
- Combinations & infrastructure
- How much data?
- Blackboards
Dan Ellis Ideas for Next-Generation ASR (16 of 21) 2003-10-07
4
• After each stage of the recognizer
Feature 1 calculation
Input sound
Feature 2 calculation
Feature combination
Speech features
Acoustic classifier
Phone probabilities
HMM decoder
Word hypotheses
10% RER improvement
(2 streams)
Feature 1 calculation
Acoustic classifier
Posterior combination
Phone probabilities
HMM decoder
Word hypotheses
20% RER improvement
(2 streams)
Input sound
Feature 2 calculation
Speech features
Acoustic classifier
Input sound
Feature 1 calculation
Feature 2 calculation
Speech features
Acoustic classifier
Acoustic classifier
Phone probabilities
HMM decoder
HMM decoder
Hypothesis combination
Final hypotheses
Word hypotheses
25% RER improvement
(5 stream
ROVER)
Dan Ellis Ideas for Next-Generation ASR (17 of 21) 2003-10-07
(Hermansky, Ellis, Sharma, ICASSP’00)
Input sound
• To combine Neural Net models with HMMs:
Speech features
Phone probabilities
Words
Feature calculation
Neural net model
Posterior decoder
Pre-nonlinearity outputs
PCA orthogn’n
Othogonal features
HTK
GM model
Subword likelihoods
HTK decoder
Words
• Result: better performance than either alone!
- Tandem alone: 20% RER improvement
Posterior combination + Tandem:
50% RER improvement
• Excellent infrastructure for feature experiments
- nets are tolerant of feature eccentricities
- e.g. MSG features
Æ
HTK has double the WER of
Tandem version, MSG
Æ net
Æ
HTK
Dan Ellis Ideas for Next-Generation ASR (18 of 21) 2003-10-07
• Near-unanimous calls for more data
- sure-fire way to improve accuracy .. a little
- labeled data is expensive, hence limited
• How much data do we need ?
- see examples of ‘all’ speech variants?
- infant example: 6 hr/day = 2000 hr/yr
= 10,000 hr by age 5 (Moore graph)
- brute force recognition-by-matching: every possible syllable?
word? phrase? x voices
100 voices x 2k syllables x 4/sec x 100 contexts
= 1,600 hr (6k syl/hr)
(but: distribution of examples)
• What about generalization ???
- goal should be abstraction of patterns from examples corpus
- i.e. marginals, not full volume of examples
Dan Ellis Ideas for Next-Generation ASR (19 of 21) 2003-10-07
Actions & Plans action next action
Scheduler
SOU
Problem-solving model
• Events,
Tiers,
Hypothesis Generation &
Verification
= Hearsay Blackboard (1973) hypotheses
Blackboard
• What went wrong last time?
bad knowledge, blame allocation
inefficient decoding
how to incorporate training?
• So, this time around?
new mechanisms for blame?
inefficiencies don’t matter so much?
induction of rules from data?
Dan Ellis Ideas for Next-Generation ASR (20 of 21) 2003-10-07
• Accept that sound is often/usually a mixture
- combine models and/or carve up features
• Use more detailed models of speech
- so we can still recognize after carving up
• Tandem models as enabling infrastructure
- able to glean value from wacky features
• Find novel approaches for recognition
- vowel nuclei + adornments?
Dan Ellis Ideas for Next-Generation ASR (21 of 21) 2003-10-07