Current work at ICSI

advertisement
Current work at ICSI
Dan Ellis
International Computer Science Institute, Berkeley CA
<dpwe@icsi.berkeley.edu>
Outline
1.
Broadcast News MLP recognizer
2.
Topic modeling
3.
Acoustic segment classification
4.
Thisl demonstrator front-end
Thisl ICSI Status - Dan Ellis
1999feb03 - 1
The modulation-filtered spectrogram
(Brian Kingsbury)
•
Goal: invariance to variable acoustics
speech
- filter out irrelevant
modulations
- channel adaptation
(on-line auto. gain
control)
- multiple
representations
Bark-scale power-spectral filterbank
x
x
lowpass
0-16 Hz
envelope
filtering
τ = 160 ms AGC
AGC
τ = 320 ms AGC
AGC
automatic
gain
control
lowpass
features
•
bandpass
2-16 Hz
AGC
AGC
τ =160 ms
AGC
AGC
τ =640 ms
bandpass
features
Results (small vocabulary):
Feature
Clean test WER
Reverb test WER
plp
5.9%
22.2%
msg
6.1%
13.8%
Thisl ICSI Status - Dan Ellis
1999feb03 - 2
Broadcast News recognizer
•
1998 evaluation - RNN + MLP
•
8000 HU nets trained for MLP-only system:
combo WER%
RNN98
MSG-8kHz
PLP-16kHz
RNN98
27.2
24.9
24.5
29.7
24.4
MSG-8kHz
PLP-16Khz
25.5
- RNN+MSG+PLP: 23.7%
- plp 8000HU forward-pass ~0.7x real time (spert)
•
Gender-dependent versions:
net set
WERF%
WERM%
WER%
plp-GD
20.3
27.2
24.6
msg-GD
plp+msg-GD
Thisl ICSI Status - Dan Ellis
1999feb03 - 3
Broadcast News: ongoing
•
Dynamic pronunciations (Eric Fosler)
- data-derived rules for context-dependent
pronunciations:
phones, syllables, words, rate ...
- rescored N-best output from 1st pass
- ~ 3% RER improvement
•
Multiband (Adam Janin / Nikki Mirghafori)
- 20% RER for small-vocabulary (Numbers)
- no significant improvement yet for BN
- features: MSG, cepstra, KLT, plp
- all-way possible combinations & weights
Thisl ICSI Status - Dan Ellis
1999feb03 - 4
Multiband for Broadcast News
•
(Adam Janin / Nikki Mirghafori)
Scheme that worked best for small vocab:
- 4-way frequency split
- plp cepstra+deltas within each band
- MLP classifier for each band + MLP combiner
Prob. estimator
..
Power
Power
Front end
..
Power
..
•
..
..
..
..
MLP Merger
..
..
Power
Speech
Signal
..
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
ViterbiDecoder
Recognized
Words
..
Weighted average of all possible combos
- p(q | a,b,c,d) = ∑S p(q | S,a,b,c,d) . p(S)
S ranges over 16 possible combinations
- p(S) from? constant, local feature (entropy)
- oracle best p(S) → WER=19% (25%RER)
Thisl ICSI Status - Dan Ellis
1999feb03 - 5
Topic modeling
(Dan Gildea & Thomas Hofmann)
•
Bayesian model:
- p(word | doc) = ∑t p(word | topic) p(topic | doc)
- EM modeling of p(word | topic) & p(topic | doc)
over training set
- p(topic | doc) estimated from context in
recognition
•
Use to modify language model weights
- p(word) ∝ ptri(word) ptop(word) / puni(word)
- WSJ: trigram perplexity of 109 reduced 17%
- use for BN recognition?
•
Use for topic segmentation?
Thisl ICSI Status - Dan Ellis
1999feb03 - 6
Acoustic Segment Classification
(Gethin Williams (SU) & Dan Ellis)
•
Features from posteriors show utterance type:
- average per-frame entropy
- ‘dynamism’ - mean squared 1st-order difference
- average energy of ‘silence’ label
- covariance matrix distance to clean speech
Speech
Segment feature scatter
3.5
50
40
30
3
20
10
100
200
300
400
500
Speech+Music
600
700
800
900
2.5
40
entropy
phone index
50
30
20
2
10
100
200
300
400
500
Music
600
700
800
900
1.5
50
40
1
30
20
10
100
200
300
400
500
16ms frames
600
700
800
900
0.5
0
0.05
0.1
0.15
0.2
0.25
dynamism
•
100% on Scheirer/Slaney speech-music testset
•
Use for acoustic segmentation?
Thisl ICSI Status - Dan Ellis
1999feb03 - 7
Thisl demo development
-
Stand-alone Tcl/Tk implementation
- doesn’t require httpd
- speech-input ready
Thisl ICSI Status - Dan Ellis
1999feb03 - 8
Download