An overview of Speech Recognition research at ICSI

advertisement
An overview of
Speech Recognition research at ICSI
Dan Ellis
International Computer Science Institute, Berkeley CA
<dpwe@icsi.berkeley.edu>
Outline
1
About ICSI
2
Hybrid connectionist-HMM speech recognition
3
Overview of current projects
4
Some details:
Combinations, News retrieval, SpeechCorder
5
Conclusions
ASR for TICSP - Dan Ellis
1999sep02 - 1
About ICSI
1
http://www.icsi.berkeley.edu/
•
Founded 1988 as ‘portal’ between U.S. and
European academic systems
- attached to UC Berkeley (but independent)
- about 100 people at any time
•
Infrastructure funding from European
government/industry
- .. in return for hosting visitors
- Germany, Italy, Spain, Switzerland, Netherlands
•
Research groups consist of staff, visitors and
UC Berkeley graduate students
•
Original vision: ‘massively parallel systems’
- has diversified since then
ASR for TICSP - Dan Ellis
1999sep02 - 2
Groups at ICSI
•
“Realization” (Real Speech Collective?)
- Nelson Morgan, Steve Greenberg, Dan Ellis
- Speech recognition for realistic conditions
- Systems support for ASR applications
•
ACIRI
(AT&T Center for Internet Research at ICSI)
- Network routing, traffic, security
•
Theory
- mathematical complexity, coding theory
•
Applications
- natural langauge understanding
- connectionist models of cognition
•
Networks
- multimedia communications applications
ASR for TICSP - Dan Ellis
1999sep02 - 3
ICSI Realization highlights
•
Connectionist framework for ASR
(Morgan & Bourlard)
•
RASTA front-end processing
(Hermansky & Morgan)
•
The Ring Array Processor (RAP)
•
SPERT/T0
- SBUS boards with custom 8-way vector µ−proc
ASR for TICSP - Dan Ellis
1999sep02 - 4
Outline
1
About ICSI
2
Hybrid connectionist-HMM speech recognition
- visualizing speech recognition
- building a recognizer
- some issues in ASR
3
Overview of current projects
4
Some details
5
Conclusions
ASR for TICSP - Dan Ellis
1999sep02 - 5
2 The Hybrid Connectionist-HMM system
•
Conventional ASR: Symbols S, observations X
S∗ = argmax P ( S X )
S
P ( S, X )
= argmax ------------------P( X )
S
= argmax
S
∏ P( X i Si ) ⋅ P( Si Si – 1 )
i
•
P(Xi|Si) is acoustic likelihood model e.g. GMM
•
Connectionist replaces with posterior, P(Si|Xi):
Neural network acoustic model
h#
pcl
bcl
tcl
dcl
C0
C1
Feature
calculation
Input
sound
Word
pronunciation
models
Grammar
HMM
decoder
C2
Posterior probabilities
for context-independent
phone classes
Ck
tn
tn+w
ASR for TICSP - Dan Ellis
1999sep02 - 6
Word
sequence
hypotheses
Visualizing speech recognition
•
Speech as a sequence of discrete symbols qi
Sound
Front end
Feature vectors
Acoustic
models
Phone
classifier
Label probabilities
Word models
Grammar
HMM
decoder
Phone & word
labelling
ASR for TICSP - Dan Ellis
1999sep02 - 7
Building a recognizer
•
Define pronunciation models
- application vocabulary
- standard dictionaries + phonetic rules?
•
Build language model
- P(Wi|Wi-1,...)
- count n-grams in example texts?
•
Train acoustic model
- choose features to suit conditions
- train neural net on large labeled corpus
- relabel & retrain?
ASR for TICSP - Dan Ellis
1999sep02 - 8
How much training data?
•
The bulk of recent improvements derives from
larger training sets
WER for PLP12N nets vs. net size & training data
44
42
WER%
40
38
36
34
32
9.25
500
18.5
Training set / hours
1000
37
2000
Hidden layer / units
74 4000
•
Largest model: 2.5M parameters, 32M patterns
= 24 days to train on TetraSPERT, 1015 ops
ASR for TICSP - Dan Ellis
1999sep02 - 9
Some issues in ASR
•
‘Spectrogram reading’ paradigm
- short-time features, concatenative models
•
Goal: classifier accuracy /
Word Error Rate (WER)
- objective measures, but quite opaque
- normalization vs. generalization
•
HMM requires search over all word sequences
- dominates processing time in large-vocabulary
•
Best solutions (e.g. features) depends on task
- RASTA plus delta-features good for small vocab
- plain normalized PLP best for Broadcast News
- modulation spectrum features best for combo...
•
Key challenge = Robustness
- to: task, acoustic conditions, speaking rate,
style, accent ...
ASR for TICSP - Dan Ellis
1999sep02 - 10
Outline
1
About ICSI
2
Hybrid connectionist-HMM speech recognition
3
Overview of current projects
- Front-end features
- Pronunciation and language modeling
- Modeling alternatives
- Other topics
4
Some details
5
Conclusions
ASR for TICSP - Dan Ellis
1999sep02 - 11
The modulation-filtered spectrogram
(Brian Kingsbury)
•
Goal: invariance to variable acoustics
speech
- filter out irrelevant
modulations
- channel adaptation
(on-line auto. gain
control)
- multiple
representations
•
Bark-scale power-spectral filterbank
x
x
lowpass
0-16 Hz
envelope
filtering
τ = 160 ms AGC
AGC
τ = 320 ms AGC
AGC
automatic
gain
control
lowpass
features
Comparison:
bandpass
2-16 Hz
AGC
AGC
τ =160 ms
AGC
AGC
τ =640 ms
bandpass
features
freq / Bark
plp log-spect features
15
10
5
freq / Bark
msg1a features
10
5
0.5
ASR for TICSP - Dan Ellis
1
1.5
2
2.5
3
time / s
3.5
1999sep02 - 12
Data-driven feature design
(Mike Shire)
•
‘Optimal’ features for different conditions
- subband envelope domain
- linear-discriminant analysis (LDA) for filter coeffs
- separation of labeled classes is optimized
•
Modulation-frequency domain responses
for clean, reverb, mixture:
1st Discriminant Filter
0
-5
dB
-10
-15
Clean
Light Reverb
Severe Reverb
Clean+Severe Reverb
-20
-25
-30
2nd Discriminant Filter
0
-5
dB
-10
-15
-20
-25
-30
0
10
ASR for TICSP - Dan Ellis
1
10
Hz
1999sep02 - 13
Automatic pronunciation extraction
(Eric Fosler)
•
Canonical pronunciations are too limited
•
Phonetic rules overproduce (→ homonyms)
•
Filter candidates against acoustics:
Word strings
Dictionary
Phone trees
The Dow Jones
or
the=dh ax
dow=d aw
jones=jh...
ax d aw
or
Letter-tophone trees
d o w
FSG building
Acoustics
FSG
dh
ax dcl
d aw
..
.
iy
d
aa
Alignment
dh dh dh dh ax ax dcl d...
FSG decoder
Acoustic confidences
dh=0.24
•
ax=0.4
dcl=0.81...
Decision trees for canonical→real mapping
ASR for TICSP - Dan Ellis
1999sep02 - 14
Topic modeling (Latent Semantic Analysis)
(Dan Gildea & Thomas Hofmann)
•
Bayesian model:
- p(word | doc) = ∑t p(word | topic) p(topic | doc)
- EM modeling of p(word | topic) & p(topic | doc)
over training set
- p(topic | doc) estimated from context in
recognition
•
Use to modify language model weights
- p(word) ∝ ptri(word) ptop(word) / puni(word)
- Trigram language model
perplexity of 109 reduced 17%
•
Use for topic segmentation?
ASR for TICSP - Dan Ellis
1999sep02 - 15
Multiband systems
(Adam Janin / Nikki Mirghafori)
•
Separate recognizers look at different bands
- Fletcher/Allen model of human speech recog.
- noise/corruption in one channel is limited
- how to combine results?
Prob. estimator
..
Power
Power
Front end
..
Power
..
•
..
..
..
..
MLP Merger
..
..
Power
Speech
Signal
..
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
ViterbiDecoder
Recognized
Words
..
Weighted average of all possible combos
- p(S | a,b,c,d) = ∑B p(S | B,a,b,c,d) . p(B)
B ranges over 16 possible band combinations
- p(B) from? constant, local feature (entropy)
ASR for TICSP - Dan Ellis
1999sep02 - 16
Buried Markov Models
(Jeff Bilmes)
•
Increasing the scope of the classifier input
•
Add state-dependent sparse links:
Q t —1 = qt —1
Q t = qt
Q t + 1 = qt + 1
Q t + 2 = qt + 2
X
Y
•
How to add links?
- maximum conditional mutual information
- greedy algorithm
•
How to model?
- linear dependence as first attempt
ASR for TICSP - Dan Ellis
1999sep02 - 17
Connectionist speaker recognition
(Dominique Genoud)
•
Use neural networks to model speakers rather
than phones?
•
Specialize a phone classifier for a particular
speaker?
•
Do both at once for “Twin-output MLP”:
Feature vectors
Posterior
probability
vectors
Output layer
Current vector
t0-4
t0+4
12 features
Non-speech
53 'speaker'
phones
9x12 features = 108 inputs
t
ASR for TICSP - Dan Ellis
MLP
108:2000:107
53 'world'
phones
1999sep02 - 18
Acoustic Segment Classification
(Gethin Williams)
•
Features from posteriors show utterance type:
- average per-frame entropy
- ‘dynamism’ - mean squared 1st-order difference
- average energy of ‘silence’ label
- covariance matrix distance to clean speech
Dynamism vs. Entropy for 2.5 second segments of speecn/music
3.5
Spectrogram
frq/Hz
Speech
Music
Speech+Music
3
4000
2.5
2000
2
speech
music
Entropy
0
speech+music
1.5
40
1
20
0
0
2
4
6
8
10
12
time/s
0.5
Posteriors
0
0
0.05
0.1
0.15
0.2
Dynamism
0.25
0.3
•
1.3% error on 2.5 second speech-music testset
•
Use for finding segment boundaries?
ASR for TICSP - Dan Ellis
1999sep02 - 19
Perceptual experiments
(Steven Greenberg et al.)
•
Use spectrally-sparse speech:
Frequency (kHz)
Spectrogram
Original
4
0
3.5
Time (s)
Slit
4
Waveform
8
6000
4762
3
2
1
2381
1890
945
750
375
298
•
Effect of temporal alignment between bands:
Synchronized Slits
De-synchronized Slits
25 ms delay
10.3
58.6
88.3%
47.1
51.7
70.7
79.7
79.5
80.4
63.4
65.0
54.8
58.9
(Intelligibility WER%)
50 ms delay
53.6
54.6
75 ms delay
41.9
ASR for TICSP - Dan Ellis
40.7
1999sep02 - 20
Outline
1
About ICSI
2
Hybrid connectionist-HMM speech recognition
3
Overview of current projects
4
Some details:
- Information stream combination
- Broadcast News spoken-document retrieval
- SpeechCorder PDA
5
Conclusions
ASR for TICSP - Dan Ellis
1999sep02 - 21
4 .1 Information stream combinations
•
Task: AURORA noisy digits
- continuous digits
- test: 4 noise types x 7 SNR levels
- train: mixed clean/noisy data
•
Feature design evaluation
- intermediate representation for mobile phones
- evaluation specifies GMM-HMM (HTK) system
•
Baseline results:
•
Feature
System
WER ratio
mfcc
HTK
100.0%
plp
Hybrid
89.6%
msg
Hybrid
87.1%
msg
HTK
205.0%
msg KG
HTK
184.5%
Can we combine features advantageously?
ASR for TICSP - Dan Ellis
1999sep02 - 22
Combination schemes
Feature 1
calculation
Feature
combination
Feature 2
calculation
HMM
decoder
Acoustic
classifier
Speech
features
Word
hypotheses
Phone
probabilities
Input
sound
Feature 1
calculation
Acoustic
classifier
HMM
decoder
Posterior
combination
Feature 2
calculation
Input
sound
Acoustic
classifier
Word
hypotheses
Phone
probabilities
Speech
features
Feature 1
calculation
Acoustic
classifier
HMM
decoder
Hypothesis
combination
Feature 2
calculation
Input
sound
•
Acoustic
classifier
Speech
features
Final
hypotheses
HMM
decoder
Phone
probabilities
Word
hypotheses
Simple probability combination works best:
P(qi|X1,X2) = P(qi|X1)·P(qi|X2) / P(qi) ... if X1⊥X2|q
ASR for TICSP - Dan Ellis
1999sep02 - 23
Posterior multiplication
Features
System
WER ratio
plp + msg
Feature combo
74.1%
plp + msg
Prob. combo
63.0%
plp + msg
HTK on probs.
51.6%
ASR for TICSP - Dan Ellis
1999sep02 - 24
4 .2
Spoken document retrieval
•
Based on DARPA/NIST Broadcast News
•
Training material recorded off-air
- ABC, CNN, CSPAN, NPR
- 200 hour training set (TREC: 550 hour archive)
- training:
word transcriptions + speaker time boundaries
•
Best WER results:
- 1996: HTK: 27%
- 1997: HTK: 16% (but: easier; 22% on 1996 eval)
- 1998: LIMSI: 14% (SPRACH: 21%)
•
Some clear conclusions
- one classifier for all conditions (or male/female)
- feature adaptation (VTLN, MLLR, SAT)
- importance of segmentation
- more data is useful
ASR for TICSP - Dan Ellis
1999sep02 - 25
Applications for BN systems
F0:
F4:
F5:
•
Live transcription
- subtitles
- transcripts
- but: more than words?
•
Video editing
- precision word-time alignments
- commercial systems by IBM, Virage, etc.
•
Information Retrieval (IR)
- TREC/MUC ‘spoken documents’
- tolerant of word error rate, e.g.:
THE VERY EARLY RETURNS OF THE NICARAGUAN PRESIDENTIAL ELECTION
SEEMED TO FADE BEFORE THE LOCAL MAYOR ON A LOT OF LAW
AT THIS STAGE OF THE ACCOUNTING FOR SEVENTY SCOTCH ONE LEADER
DANIEL ORTEGA IS IN SECOND PLACE THERE WERE TWENTY THREE
PRESIDENTIAL CANDIDATES OF THE ELECTION
THE LABOR MIGHT DO WELL TO REMEMBER THE LOST A MAJOR EPISODE OF
TRANSATLANTIC CONNECT TO A CORPORATION IN BOTH CONSERVATIVE PARTY
OFFICIALS FROM BRITAIN GOING TO WASHINGTON THEY WENT TO WOOD BUYS
GEORGE BUSH ON HOW TO WIN A SECOND TO NONE IN LONDON THIS IS
STEPHEN BEARD FOR MARKETPLACE
ASR for TICSP - Dan Ellis
1999sep02 - 26
Thematic Indexing of Spoken Language
(Thisl)
•
EC collaboration, BBC providing data
•
1000+ hr archive data
•
IR is key factor
- stop lists
- weighting schemes
- query expansion
Query
Archive
Database
Segmentation
Control
IR
Receiver
ASR
NLP
ASR
Text
http
Audio
Video
ASR for TICSP - Dan Ellis
1999sep02 - 27
ThislGui
•
Tcl/Tk front-end to Thisl IR engine
•
Spoken query input: SPRACHdemo/AbbotDemo
•
NLP integration: prolog lattice parser
ASR for TICSP - Dan Ellis
1999sep02 - 28
4 .3
SpeechCorder
•
Convergence of interesting problems:
- ubiquitous PDAs
- multimedia processing
- very fast, low-power CPU design
- resource-bound speech recognition
•
ICSI / UC Berkeley / MIT collaboration
- ICSI: speech & audio processing
- UCB: user interface design
- MIT: new low-power CPUs
•
Current proposal
- PDA
- meeting / memo recorder
- .. also for sociological study?
(new Human Centered Computing consortium)
ASR for TICSP - Dan Ellis
1999sep02 - 29
The SpeechCorder GUI
•
Live annotation of recognized speech
•
Application issues:
- correcting errorful transcriptions
- finding places in the recording
- annotations (speaker, notes, emphasis)
ASR for TICSP - Dan Ellis
1999sep02 - 30
SpeechCorder: Audio
•
Speech recognition
- not close-mic’d
- speaker ID
- low-power / low-memory / vectorized
- non-local processing for ‘best’ transcript?
•
Other audio issues
- identifying speech versus nonspeech
- finding / indexing nonspeech events...
ASR for TICSP - Dan Ellis
1999sep02 - 31
Outline
1
About ICSI
2
Hybrid connectionist-HMM speech recognition
3
Overview of current projects
4
Some details
5
Conclusions
- more criticisms of ASR
- future work
ASR for TICSP - Dan Ellis
1999sep02 - 32
Conclusions
5
•
The downside of objective evaluation
- research priority has been pragmatic goal:
reduce WER
- human speech recog. uses many constraints
- grammatic/semantic constraints implicit in word
sequence statistics (grammar)
- automatic analysis of large corpora is possible &
helpful
•
The problems with a grammar
- unexpected (unseen) phrases are discounted
- highly brittle alternatives
- masks underlying performance
•
A more scientific approach
- first work on the underlying phoneme classifier
- follow nonsense syllable performance (Fletcher)
ASR for TICSP - Dan Ellis
1999sep02 - 33
The signal model in speech recognition
•
Systems & approach have been optimized for
speech-alone situation
- minimize classifier parameters, maximize use of
‘feature space’
- e.g. cepstra [example]
•
Possibly non-lexical data thrown away
- pitch
- timing/rhythm
- speaker identification
•
Dire consequences
- .. dealing with nonspeech sounds
- .. distinguishing success & failure
•
Popular focus of research
- e.g. segmental models, pitch features
- fail to obtain robust improvements
ASR for TICSP - Dan Ellis
1999sep02 - 34
Future work
•
Continue improving robustness
- better features
- better pronunciations
- better modeling
•
Still looking for a good architecture
- multiband
- multistream
- more adaptation
- more contextual dependence
•
Speech recognition: useful for applications
- archive indexing, summarizing
- personal devices, new interfaces
- tie-in to general audio analysis...
ASR for TICSP - Dan Ellis
1999sep02 - 35
Download