ICSI /ThisL status report

advertisement
ICSI /ThisL status report
December 1997
Dan Ellis
International Computer Science Institute, Berkeley CA
<dpwe@icsi.berkeley.edu>
1
Software tools & packages
2
Speech/nonspeech separation
3
Speech in reverberation
ThisL; ICSI status - Dan Ellis
1997dec10 - 1
1
Software tools & packages:
ICSI Speech Recognition system
•
New components:
- feacalc: enhanced RASTA (I/O formats,
options)
- pfile_utils: comprehensive manipulations
(editing, stats, etc.)
•
Portable package:
- first test: bring up recognizer at IDIAP
•
Visualization
- [incr Tcl] classes for display
- recognizer visualization...
ThisL; ICSI status - Dan Ellis
1997dec10 - 2
Software tools & packages:
recogui visualization
- modular objects for reuse
- simple configuration files
- broad use within ICSI
ThisL; ICSI status - Dan Ellis
1997dec10 - 3
2
Speech vs. nonspeech:
Comp. Aud. Scene Analysis for ASR
frq/Hz
•
For handling sound mixtures, attempt to
estimate individual sound sources
- listeners do this transparently
•
Previous approach (Weintraub...)
- ‘enhance-then-recognize’:
extract by periodicity, resynthesize, recognize
•
But...
- problems with ‘holes’
- which cues to separate speech?
...doesn’t exploit knowledge of speech structure
brn1h.aif
frq/Hz
3000
3000
2000
1500
2000
1500
1000
1000
600
600
400
300
400
300
200
150
200
150
100
brn1h.fi.aif
100
0.2
0.4
ThisL; ICSI status - Dan Ellis
0.6
0.8
1.0
time/s
0.2
0.4
0.6
0.8
1997dec10 - 4
1.0
time/s
Prediction-driven CASA and ASR
•
Prediction-driven CASA:
- don’t derive, but construct an explanation
consistent with observations
hypotheses
Periodic
elements
prediction
errors
input
mixture
Front end
signal
features
Compare
& reconcile
Noise
elements
Predict
& combine
Speech
elements
dd-vs-pd-1997oct
Hypothesis
management
predicted
features
- need to express ‘predictions’ in signal domain
- iterate over each component
•
Components make projections
- into e.g. ‘the space of all speech sounds’
ThisL; ICSI status - Dan Ellis
1997dec10 - 5
A Speech Hypothesis module
HMM
decoder
phonetic label
probabilities
phonetic
label sequence
Ideal
reconstitution
Neural net
classifier
cepstral
feature vectors
cepstral
feature vectors
Inverse
cepstrum
PLP
cepstrum
normalized
spectral features
partial
residual
spectrum
RASTA
normalization
words
modif-ASR
dpwe 1997apr09
conventional
speech recognizer
spectral
feature vectors
slowlyvarying
components
Denormalize
estimated
spectral
envelope
•
Want to exploit constraints of decoder
•
Invert each stage of speech recognizer
- classifier by? trained estimator
- normalization by: recovering from input
ThisL; ICSI status - Dan Ellis
1997dec10 - 6
Preliminary results
f/Hz
223cl
4000
2000
1000
400
200
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
0.4
0.6
0.8
1.0
1.2
1.4
Speech1
Click5
20
Clicks1−7
10
0
−10
dB
0.0
0.2
time/s
•
Prediction shortfall dominates result
- improve inverse classification
- more normalization
•
To complete iteration:
- need p(q|X,M)
•
Initial separation by f0?
ThisL; ICSI status - Dan Ellis
1997dec10 - 7
Speech-in-reverberation
3
•
Modest reverb has severe impact
(RT = 0.5s, D/R ≈ 0 dB)
•
Information/combination at various timescales
- modulation spectral features, syllable units
- combine results at utterance level (Nbests)
- combine results at syllable level
(HMM decomposition, [Dupont & Bourlard ’97]
WER%
Clean speech
Reverb (6 dB SNR)
Baseline (Rasta-PLP8)
6.8
27.8
ModSpec Syllable base
9.8
30.9
Utterance-level combin’n
5.5
19.6
Syllable-level combin’n
5.4
18.6
•
2 pass decoder to avoid state explosion
- lattice output for compatibility
ThisL; ICSI status - Dan Ellis
1997dec10 - 8
Download