Improved recognition by combining different features and different systems Outline

advertisement
Improved recognition by combining
different features and different systems
Dan Ellis
International Computer Science Institute
Berkeley CA
dpwe@icsi.berkeley.edu
Outline
1
The power of combination
2
Different ways to combine
3
Examples & results
4
Conclusions
AVIOS - Combinations - Dan Ellis
2000-05-24 - 1
The power of combination
1
•
Combination is a general approach in statistics
- several models → several estimates
- if ‘correct’ parts more consistent than ‘wrong’
parts...
→averaging reduces error
•
Intuition: choose ‘right’ model for each case
- .. need to know when each is ‘right’
- .. need models that are ‘right’ at different times
•
Continuum of combination schemes
- conservative, ignorant of models
- domain-specific, trained on models
AVIOS - Combinations - Dan Ellis
2000-05-24 - 2
Relevant aspects of the speech signal
f/Hz
•
Speech is highly redundant
- intelligible despite large distortions
- multiple cues for each phoneme
•
Speech is very variable
- redundancy leaves room for variability
- speakers can use different subsets of cues
4000
FAG_54A.08
3000
2000
1000
0
0.0
0.1
0.2
0.3
h#
f/Hz
4000
0.4
f
0.5
ay
0.6
v
0.7
0.8
0.9
ao
f
1.0
1.1
1.2
r
FEE_54A.08
3000
2000
1000
0
0.0
f/Hz
4000
20
10
0.1
h#
0.2
0.3
f
0.4
ay
0.5
v
0.6
f
0.7
0.8
0.9
ao
1.0
r
1.1
h#
0
dB
FEI_54A.08
3000
2000
1000
0
0.0
0.1
0.2
h#
AVIOS - Combinations - Dan Ellis
0.3
f
0.4
0.5
ay
0.6
v
0.7
f
0.8
0.9
1.0
ao
1.1
r
1.2
2000-05-24 - 3
Combinations for speech recognition
•
Speech recognition abounds with different
models
- different feature processing, statistical models,
search techniques ...
•
Redundancy in signal
→
many different ways to estimate,
making different kinds of errors
•
General combination is easier than figuring out
what is good in each estimator
- training data & training process
usually the limiting factor
AVIOS - Combinations - Dan Ellis
2000-05-24 - 4
Outline
1
The power of combination
2
Different ways to combine
- Feature combination
- Posterior combination
- Hypothesis combination
- System hybrids
3
Examples & results
4
Conclusions
AVIOS - Combinations - Dan Ellis
2000-05-24 - 5
2
Speech recognizer components
sound
Feature
calculation
feature vectors
Network
weights
Acoustic
classifier
phone probabilities
Word models
Language model
HMM
decoder
phone & word
labeling
AVIOS - Combinations - Dan Ellis
2000-05-24 - 6
Different ways to combine
•
After each stage of the recognizer
Feature 1
calculation
Feature
combination
Feature 2
calculation
HMM
decoder
Acoustic
classifier
Speech
features
Word
hypotheses
Phone
probabilities
Input
sound
Feature 1
calculation
Acoustic
classifier
HMM
decoder
Posterior
combination
Feature 2
calculation
Input
sound
Acoustic
classifier
Word
hypotheses
Phone
probabilities
Speech
features
Feature 1
calculation
Acoustic
classifier
HMM
decoder
Hypothesis
combination
Feature 2
calculation
Input
sound
•
Acoustic
classifier
Speech
features
Final
hypotheses
HMM
decoder
Phone
probabilities
Word
hypotheses
Lots of other ways...
AVIOS - Combinations - Dan Ellis
2000-05-24 - 7
Feature combination (FC)
•
Concatenate different feature vectors
•
Train a single acoustic model
freq / Bark
plp log-spect features
15
10
5
freq / Bark
msg1a features
10
5
0.5
•
1
1.5
2
time / s
It helps:
Features
Avg. WER
plp
8.9%
msg
9.5%
FC plp + msg
8.1%
AVIOS - Combinations - Dan Ellis
2000-05-24 - 8
Posterior combination (PC)
N3_SNR5/FIL_3O37A
0.0
0.1
0.2
0.3
plp12ddN cepstra
h#
s th t
msg3N
h#
r
tth
f
n
uw
r
th
0.5
iy
ruw ey iy
Posterior Combo
h#
s tth
h#
0.4
0.6
q
iy
r
M
•
0.9
n
ow
f q
iy
0.8
ow
f qf
three
i it l @ T
0.7
th
th
ow
q
1.0
ow
th
th
1.1
r
iy
r
iy
r
oh
1.3
n iyow f s f eh f
s
iy
r
1.2
eh
n
iy
1.4
1.5
v
s eh
v
s
eh v
three
n
n
ax
n
ax
sev
Sometimes better than FC:
Avg. WER
plp
8.9%
msg
9.5%
PC plp + msg
7.1%
AVIOS - Combinations - Dan Ellis
ax
v ax
23 18 44 01 PDT 2000
Features
1.6
2000-05-24 - 9
Hypothesis combination (HC)
(J. Fiscus, NIST)
•
ROVER:
Recognizer Output Voting Error Reduction
•
Final outputs from several recognizers,
each word tagged with confidence
•
Align & vote for resulting word sequence
•
25% relative improvement over best single
Broadcast News system (14.1%→10.6%)
AVIOS - Combinations - Dan Ellis
2000-05-24 - 10
Other combinations:
‘Tandem’ acoustic modeling
(with Hermansky et al., OGI)
•
Input
sound
Can we combine with conventional models?
Speech
features
Feature
calculation
Phone
probabilities
Neural
net model
Words
Posterior
decoder
Subword
likelihoods
Othogonal
features
Pre-nonlinearity
outputs
•
PCA
orthogn’n
HTK
GM model
Words
HTK
decoder
Result: better performance than either alone!
- neural net & Gaussian mixture models
extract different information from training data
System-features
Avg. WER
HTK-mfcc
13.7%
Neural net-mfcc
9.3%
Tandem-mfcc
7.4%
AVIOS - Combinations - Dan Ellis
2000-05-24 - 11
How & what to combine
(with Jeff Bilmes, U. Washington)
•
Combination is good, but...
- which streams should we combine?
- which combination methods to use?
•
Best streams to combine have complementary
information
- can measure as Conditional Mutual Information,
I(X ; Y | C)
- low CMI suggests combination potential
•
Choice of combination method depends:
- FC for streams with complex interdependence
- PC makes best use of limited training data
- HC allows different subword units
AVIOS - Combinations - Dan Ellis
2000-05-24 - 12
Outline
1
The power of combination
2
Different ways to combine
3
Examples & results
- The SPRACH system for Broadcast News
- OGI-ICSI-Qualcomm Aurora recognizer
4
Conclusions
AVIOS - Combinations - Dan Ellis
2000-05-24 - 13
The SPRACH Broadcast News recognizer
(with Cambridge & Sheffield)
•
Multiple feature streams
•
MLP and RNN models, including PC
•
Rover for hypothesis combination
Perceptual
Linear
Prediction
Chronos
Decoder
CI RNN
Speech
Modulation
Spectrogram
Chronos
Decoder
ROVER
CI MLP
Utterance
Hypothesis
Perceptual
Linear
Prediction
Chronos
Decoder
CD RNN
•
20% WER overall (c/w ~27% per stream)
AVIOS - Combinations - Dan Ellis
2000-05-24 - 14
The OGI-ICSI-Qualcomm system
for Aurora noisy digits
•
PC of feature streams
+ Tandem combination of neural nets and
Gaussian mixture models:
plp
calculation
Neural
net model
msg
calculation
Neural
net model
traps
calculation
Neural
net model
Input
sound
Speech
features
•
x
HTK
GM model
HTK
decoder
Subword
likelihoods
Phone
probabilities
60% reduction in word errors
compared to MFCC-HTK baseline
AVIOS - Combinations - Dan Ellis
2000-05-24 - 15
Word
hypotheses
Aurora evaluation results
ETSI Aurora 1999 Evaluation
Avg WER -20-0
Baseline reduction
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
AVIOS - Combinations - Dan Ellis
2
Ta
nd
em
S6
1
Ta
nd
em
S5
S4
S3
S2
-10.00%
S1
Ba
se
lin
e
0.00%
2000-05-24 - 16
Conclusions
4
•
Combination is a simple way to leverage
multiple models
- speech recognition has lots of models
- redundancy in the speech signals allows each
model to find different ‘information’
•
Lots of ways to combine
- after feature extraction, classification, decoding
- ‘tandem’ hybrids etc.
•
Significant gains from simple approaches
- e.g. 25% relative improvement from posterior
averaging
•
Better insight into sources of benefits
will lead to even greater gains
AVIOS - Combinations - Dan Ellis
2000-05-24 - 17
Download