Speech interfaces: A survey and some current projects

advertisement
Speech interfaces:
A survey and some current projects
Dan Ellis & Nelson Morgan
International Computer Science Institute
Berkeley CA
{dpwe,morgan}@icsi.berkeley.edu
Outline
1
Speech recognition: the state of the art
2
Current projects at ICSI
3
Conclusions
HCC - Speech Interfaces - Dan Ellis
2000-02-24 - 1
Speech recognition
1
sound
Feature
calculation
feature vectors
Acoustic model
parameters
Acoustic
classifier
Word models
s
ah
t
Language model
p("sat"|"the","cat")
p("saw"|"the","cat")
•
phone probabilities
HMM
decoder
phone / word
sequence
Understanding/
application...
Elements of a recognizer:
- feature design
- acoustic modeling
- pronunciation/language modeling
HCC - Speech Interfaces - Dan Ellis
}
data!
2000-02-24 - 2
How good is speech recognition?
•
F0:
F4:
Standard measure is word error rate (WER):
- dictation (close-mic): 2-5%
- broadcast news: ~15%
- telephone conversations: ~30%
THE VERY EARLY RETURNS OF THE NICARAGUAN PRESIDENTIAL ELECTION
SEEMED TO FADE BEFORE THE LOCAL MAYOR ON A LOT OF LAW
AT THIS STAGE OF THE ACCOUNTING FOR SEVENTY SCOTCH ONE LEADER
DANIEL ORTEGA IS IN SECOND PLACE THERE WERE TWENTY THREE
PRESIDENTIAL CANDIDATES OF THE ELECTION
•
What are the problems?
- acoustic variability (noise, channel)
- speech variability (accent, manner)
- exploiting linguistic constraints
- speech understanding...
HCC - Speech Interfaces - Dan Ellis
2000-02-24 - 3
Frontiers of speech recognition
•
Acoustic modeling
- beyond head-mounted mics
- background noise (mobile phones)
- speech in mixtures (broadcast)
→ robust feature design, better statistical models
•
Speaking styles
- coarticulation
- pronunciation variability
- speaking styles
→ lump into acoustic model, more training data,
better pron. models, context-dep. models
•
Linguistic constraints
- ‘inferred’ words
- ambiguity
→ higher-order n-grams (more training data),
tree grammars
HCC - Speech Interfaces - Dan Ellis
2000-02-24 - 4
Applications of speech recognition
•
Command & control
- more or less constrained
•
Dictation
- large vocabulary
- known, co-operative user
•
Voice response systems
- dialog & speech understanding
- robustness!
- human factors (timing, barge-in etc.)
•
Information extraction & retrieval
- multimedia archive retrieval
- live ‘listener’
HCC - Speech Interfaces - Dan Ellis
2000-02-24 - 5
Outline
1
Speech recognition: the state of the art
2
Current projects at ICSI
- Recognizer confidence measures
- Combining information sources
- The meeting recorder
- Audio content-based retrieval
3
Conclusions
HCC - Speech Interfaces - Dan Ellis
2000-02-24 - 6
Recognizer confidence measures
(Warner Warren, Andy Hatch, Eric Fosler + SRI)
•
Knowing which words are wrong can help
- hard to tell because recognition only just works
•
Average per-phone entropy + re-estimation:
DET plot for word-level confidence estimation (AURORA)
40
Miss probability (in %)
20
10
5
2
1
0.5
raw posteriors
fwd-bwd posteriors
0.2
0.1
0.1 0.2
•
0.5
1
2
5
10
False Alarm probability (in %)
20
40
Use for combining recognizer outputs
HCC - Speech Interfaces - Dan Ellis
2000-02-24 - 7
Combination schemes
(Mike Shire, Barry Chen + Michael Jordan)
Feature 1
calculation
Feature
combination
Feature 2
calculation
HMM
decoder
Acoustic
classifier
Speech
features
Word
hypotheses
Phone
probabilities
“^”
Input
sound
Feature 1
calculation
Acoustic
classifier
HMM
decoder
Posterior
combination
Feature 2
calculation
Input
sound
Acoustic
classifier
Word
hypotheses
Phone
probabilities
“•”
Speech
features
•
How best to combine different feature streams?
Features
Avg. WER
plp
8.9%
msg
9.5%
plp ^ msg
8.1%
plp • msg
7.1%
HCC - Speech Interfaces - Dan Ellis
2000-02-24 - 8
Tandem acoustic modeling
(with Hermansky et al., OGI)
•
ICSI pioneered ‘hybrid-connectionist’ ASR;
Can it be combined with conventional models?
Feature
calculation
Input
sound
Neural
net model
(Posterior
decoder)
(Phone
probabilities)
Speech
features
Pre-nonlinearity
outputs
•
PCA
orthogn'n
(Hybrid system
output)
Subword
likelihoods
Othogonal
features
HTK
GM model
Tandem system
output
HTK
decoder
Result: better performance than either alone!
- neural net & Gaussian mixture models
extract different information from training data
System-features
Avg. WER
HTK-mfcc
13.7%
Hybrid-mfcc
9.3%
Tandem-mfcc
7.4%
Tandem-plp+msg
6.4%
HCC - Speech Interfaces - Dan Ellis
2000-02-24 - 9
Aurora “Distributed SR” evaluation
•
Organized by ETSI
(European Telecoms. Standards Institute)
Avg WER -20-0
Baseline reduction
ETSI Aurora 1999 Evaluation
70%
60%
50%
40%
30%
20%
10%
2
Ta
nd
em
S6
1
Ta
nd
em
S5
S4
S3
S2
S1
Ba
se
lin
e
0%
- Tandem systems from OGI-ICSI-Qualcomm
HCC - Speech Interfaces - Dan Ellis
2000-02-24 - 10
The meeting recorder project
(Adam Janin, Eric Fosler + UW, SRI, UPM, James Landay)
•
Idea: PDA records meetings
to replace / enhance note-taking
•
First task: Collect a training corpus
•
Related to DARPA Communicator, SmartKom
HCC - Speech Interfaces - Dan Ellis
2000-02-24 - 11
Meeting recorder: Research areas
•
Audio recognition
- recognition from noisy microphones
- speaker identification & tracking
- nonspeech events
•
Indexing application
- understanding the structure of meetings
- information retrieval
- user interface
HCC - Speech Interfaces - Dan Ellis
2000-02-24 - 12
Audio content-based retrieval
(with Sheffield, Cambridge, BBC, Avideh Zakhor)
•
Idea: speech recognition output
as indexes for broadcast news
- useful even with 15-30% WER
HCC - Speech Interfaces - Dan Ellis
2000-02-24 - 13
Audio-video organization & retrieval
•
Proposed project:
Boundaries &
structuring
Speech
processing
audio
Nonspeech
analysis
Audio-video
data
IR
engine
Indexing
terms
symbolic
analogic
A-V
match
video
Query
Video
features
Summarization
results
•
Synergy between audio & video features
•
Query by terms or by examples
•
Recovering temporal structuret
HCC - Speech Interfaces - Dan Ellis
2000-02-24 - 14
Conclusions
3
•
Speech recognition is now practical
- .. but still plenty of problems
•
Ongoing research in speech recognition
- recognition in demanding conditions
- understanding / discourse a big issue
•
Multimodal information retrieval
- forgiving & fertile research area
HCC - Speech Interfaces - Dan Ellis
2000-02-24 - 15
Download