Pushing the Envelope

advertisement
“Pushing the Envelope”
A six month report
By the Novel Approaches team,
With site leaders:
Nelson Morgan, ICSI
Hynek Hermansky, OGI
Dan Ellis, Columbia
Kemal Sönmez, SRI
Mari Ostendorf, UW
Hervé Bourlard, IDIAP/EPFL
George Doddington, NA-sayer
Overview
Nelson Morgan, ICSI
The Current Cast of
Characters
• ICSI: Morgan, Q. Zhu, B. Chen, G. Doddington
• UW: M. Ostendorf, Ö. Çetin
• OGI: H. Hermansky, S. Sivadas, P. Jain
• Columbia: D. Ellis, M. Athineos
• SRI: K. Sönmez
• IDIAP: H. Bourlard, J. Ajmera, V. Tyagi
Rethinking Acoustic
Processing for ASR
• Escape dependence on spectral envelope
• Use multiple front-ends across time/freq
• Modify statistical models to accommodate
new front-ends
• Design optimal combination schemes for
multiple models
Task 1: Pushing the
Envelope (aside)
OLD
10 ms
estimate
of sound
identity
• Problem: Spectral envelope is a fragile information carrier
up to 1s
kth estimate
ith estimate
nth estimate
information
fusion
PROPOSED
estimate
of sound
identity
time
• Solution: Probabilities from multiple time-frequency patches
Task 2: Beyond Frames…
OLD
short-term features
conventional HMM
• Problem: Features & models interact;
new features may require different models
PROPOSED
advanced features
multi-rate, dynamic-scale classifier
• Solution: Advanced features require advanced models,
free of fixed-frame-rate paradigm
Today’s presentation
• Infrastructure: training, testing, software
• Initial Experiments: pilot studies
• Directions: where we’re headed
Infrastructure
Kemal Sönmez, SRI
(SRI/UW/ICSI effort)
Initial Experimental Paradigm
• Focus on a small task to facilitate exploratory
work (later move to CTS)
• Choose a task where LM is fixed & plays a
minor role (to focus on acoustics)
• Use mismatched train/test data:
 To avoid tuning to the task
 To facilitate later move to CTS
• Task: OGI numbers/ Train: swbd+macrophone
Hub5 “Short” Training Set
• Composition
(total ~ 60 hours)
Corpus
callhome
switchboard*
credit-card
macrophone
hours
Male Female
2.8
13.8
5.9
6.7
12.4
* subset of SWB-1 hand-checked at SRI
for accuracy of transcriptions and segmentations
• WER 2-4% higher vs. full 250+ hour training
4.3
7.1
5.8
Reduced UW Training Set
• A reduced training set to shorten expt. turn-around time
• Choose training utterances with per-frame likelihood
scores close to the training set average
• 1/4th of the original training set
• Statistics (gender, data set constituencies) are similar to
that of the full training set.
data set constituencies
macrophone
callhome
creditcard
other
switchboard
male/female
“short”
32%
32%
12%
24%
45/55%
Reduced
(UW)
38%
28%
12%
22%
48/52%
• For OGI Numbers, no significant WER sacrifice in the
baseline HMM system (worse for Hub 5).
Development Test Sets
•
•
•
A “Core-Subset” of OGI’s Numbers 95 corpora – telephone speech
of people reciting addresses, telephone numbers, zip codes, or
other miscellaneous items
“Core-Subset” or “CS” consists of utterances that were
phonetically hand-transcribed, intelligible, and contained only
numbers
Vocabulary Size: 32 words (digits + eleven, twelve… twenty…
hundred…thousand, etc.)
Data Set Name
Total Utterance
Total Words
Duration (hours)
Numbers95-CS
Cross Validation
357
1353
~0.2
Numbers95-CS
Development
1206
4673
~0.6
Numbers95-CS
Test
1227
4757
~0.6
Statistical Modeling Tools
• HTK (Hidden Markov Toolkit) for establishing an HMM
baseline, debugging
• GMTK (Graphical Models Toolkit) for implementing
advanced models with multiple feature/state streams
 Allows direct dependencies across streams
 Not limited by single-rate, single-stream paradigm
 Rapid model specification/training/testing
• SRI Decipher system for providing lattices to rescore
(later in CTS expts)
• Neural network tools from ICSI for posterior probability
estimation, other statistical software from IDIAP
Baseline SRI Recognizer
for the numbers task
• Bottom-up state-clustered Gaussian mixture
HMMs for acoustic modeling
• Acoustic adaptation to speakers using affine mean
and variance transforms[Not used for numbers]
• Vocal-tract length normalization using maximum
likelihood estimation [Not helpful for numbers]
• Progressive search with lattice recognition and Nbest rescoring [To be used in later work]
• Bigram LM
Initial Experiments
Barry Chen, ICSI
Hynek Hermansky, OHSU (OGI)
Özgür Çetin, UW
Goals of Initial Experiments
• Establish performance baselines
 HMM + standard features (MFCC, PLP)
 HMM + current best from ICSI/OGI
• Develop infrastructure for new models
 GMTK for multi-stream & multi-rate features
 Novel features based on large timespans
 Novel features based on temporal fine structure
• Provide fodder for future error analysis
ICSI Baseline experiments
• PLP based - SRI system
• “Tandem” PLP-based ANN + SRI system
• Initial combination approach
Development Baseline:
Gender Independent
PLP System
Training Set
Word,Sentence
Error Rate on
Numbers95-CS Test Set
Full “Short” Hub5
(85k utterances, ~64.9 hrs)
3.4%,10.2%
UW Reduced Hub5
(20k utterances, ~18.8 hrs)
3.8%,11.4%
Phonetically Trained Neural Net
• Multi-Layer Perceptron (input, hidden, and output layer)
• Trained Using Error-Backpropagation Technique – outputs
interpreted as posterior probabilities of target classes
• Training Targets: 47 mono-phone targets from forced alignment
using SRI Eval 2002 system
• Training Utterances: UW Reduced Hub5 Set
• Training Features: PLP12+e+d+dd, mean & variance normalized on
per-conversation side basis
• MLP Topology:
 9 Frame Context Window (4 frames in past + current frame + 4
frames in future)
 351 Input Units, 1500 Hidden Units, and 47 Output Units
 Total Number of Parameters: ~600k
Baseline ICSI Tandem
• Outputs of Neural Net before final softmax non-linearity used
as inputs to PCA
• PCA without dimensionality reduction
• 4.1% Word and 11.7% Sentence Error Rate on Numbers95-CS
test set
Baseline ICSI Tandem+PLP
• PLP Stream concatenated with neural net posteriors stream
• PCA reduces dimensionality of posteriors stream to 16 (keeping
95% of overall variance)
• 3.3% Word and 9.5% Sentence Error Rate on Numbers95-CS
test set
Word and String Error Rates on
Numbers95-CS Test Set
OGI Experiments:
New Features in EARS
• Develop on home-grown ASR system
(phoneme-based HTK)
• Pass the most promising to ICSI for
running in SRI LVCSR system
• So far
 new features match the performance of the
baseline PLP features but do not exceed it
 advantage seen in combination with the baseline
Looking to the human auditory
system for design inspiration
• Psychophysics
 Components within
certain frequency range
(several critical bands)
interact
[e.g. frequency masking]
 Components within
certain time span (a few
hundreds of ms)
interact
[e.g. temporal masking]
• Physiology
 2-D (time-frequency)
matched filters for
activity in auditory
cortex
[cortical receptive fields]
TRAP-based HMM-NN hybrid ASR
101 point
input
Multilayer
Perceptron
(MLP)
Posterior probabilities
of phonemes
Multilayer
Perceptron
(MLP)
Mean &
variance
normalized,
hamming windowed
critical band
trajectory
Multilayer
Perceptron
(MLP)
Search
for the best
match
Feature estimation from
linearly transformed temporal patterns
transform
MLP
TANDEM
transform
MLP
?
?
?
HMM
ASR
Preliminary TANDEM/TRAP
results (OGI-HTK)
WER% on OGI numbers, training on UW reduced training set,
monophone models
BASELINE
4.5
TANDEM
4.1
TANDEM with
TRAP
3.9
Features from more than one
critical-band temporal trajectory
Studying KLT-derived basis functions, we observe:
cosine
transform
+
average
frequency
derivative
UW Baseline Experiments
• Constructed an HTK-based HMM system that is
competitive with the SRI system
• Replicated the HMM system in GMTK
• Move on to models which integrate information
from multiple sources in a principled manner:
Multiple feature streams (multi-stream
models)
Different time scales (multi-rate models)
• Focus on statistical models not on feature
extraction
HTK HMM Baseline
• An HTK-based standard HMM system:
• 3 state triphones with decision-tree clustering,
• Mixture of diagonal Gaussians as state output dists.,
• No adaptation, fixed LM.
• Dimensions explored:
• Front-end: PLP vs. MFCC, VTLN
• Gender dependent vs. independent modeling
• Conclusions:
• No significant performance differences
• Decided on PLPs, no VTLN, gender-independent models for
simplicity
HMM Baselines (cont.)
• Replicated HTK baseline with equivalent results in GMTK
WER %
tool
dev
test
HTK
3.7
3.2
GMTK
3.7
3.0
• To reduce experiment turn-around time, wanted to reduce
the training set
• For HMMs and Numbers95, 3/4th of the training data can
be safely ignored:
Training set
WER %
dev
test
Full “short”
3.7
3.2
1/4th
(“reduced”)
3.4
3.4
Multi-stream Models
• Information fusion from multiple streams of features
• Partially asynchronous state sequences
STATE TOPOLOGY
GRAPHICAL MODEL
states of stream X
feature stream X
states of stream Y
state seq. of stream X
state seq. of stream Y
feature stream Y
model
HMM (PLP)
multi-stream
(PLP+MFCC)
WER %
dev
test
3.9
4.2
Temporal envelope features
(Columbia)
• Temporal fine structure is lost (deliberately) in
STFT features:
mpgr1-sx419
0.15
0.1
0.05
0
-0.05
0.65
0.7
0.75
0.8
0.85
0.9
8000
10 ms
windows
0
6000
-20
4000
-40
2000
0
0.65
0.7
0.75
0.8
time / sec
0.85
0.9
• Need a compact, parametric description...
-60
dB
Frequency-Domain
Linear Prediction (FDLP)
• Extend LPC with LP model of spectrum
TD-LP
FD-LP
DFT
y[n] = Siaiy[n-i]
Y[k] = SibiY[k-i]
• ‘Poles’ represent temporal peaks:
mpgr1-sx419: TDLPC env (60 poles / 30 0 ms)
0.1
0.05
0
-0.05
0.65
0.7
0.75
0.8
0.85
0.9
• Features ~ pole bandwidth, ‘frequency’
Preliminary FDLP Results
• Distribution of pole magnitudes for
different phone classes (in 4 bands):
0.1
0-500 Hz band
500-1000 Hz band
2-4 kHz band
1-2 kHz band
/ah/
/p/
0.08
0.06
0.04
0.02
0
-2
0
2
4
6
-2
0
2
4
6
-2
0
2
4
6
-2
0
2
-log(1- ||)
• NN Classifier Frame Accuracies:
plp12N
plp12N+FDLP4
57.0%
58.2%
4
6
Directions
Dan Ellis, Columbia
(SRI/UW/Columbia work)
Nelson Morgan, ICSI
(OGI/IDIAP/ICSI work + summary)
Multi-rate Models (UW)
• Integrate acoustic information from different time scales
• Account for dependencies across scales
• Better robustness against time- and/or frequency localized
interferences
•Reduced redundancy gives better confidence estimates
Cross-scale
dependencies
(example)
long-term features
coarse state chain
fine state chain
short-term features
SRI Directions
• Task 1:
Signal-adaptive weighting of time-frequency patches
 Basis-entropy based representation
 Matching pursuit search for optimal weighting of patches
 Optimality based on minimum entropy criterion
• Task 2:
Graphical models of patch combinations
 Tiling-driven dependency modeling
 GM combines across patch selections
 Optimality based on information in representation
Data-derived phonetic
features (Columbia)
• Find a set of independent attributes to
account for phonetic (lexical) distinctions
phones replaced by feature streams
• Will require new pronunciation models
asynchronous feature transitions (no phones)
mapping from phonetics (for unseen words)
Joint work with Eric Fosler-Lussier
ICA for feature bases
• PCA finds decorrelated bases;
ICA finds independent bases
test/dr1/faks0/sa2
15
10
5
Basis vectors
1
8
6
0
4
2
0
0
1
2
3
4
-1
8
2
6
4
1
2
0
0
d ow n ae
s
m iy t ix k eh
r iy ix n
oy
l iy r
ae
g l ay
k dh ae
tcl
-1
time / labels
• Lexically-sufficient ICA basis set?
0
5
10
15
20
frequency / Bark
OGI Directions:
Targets in sub-bands
• Initially context-independent and bandspecific phonemes
• Gradually shifted to band-specific 6 broad
phonetic classes (stops, fricatives, nasals,
vowels, silence, flaps)
• Moving towards band-independent speech
classes (vocalic-like, fricative-like, plosivelike, ???)
More than one temporal pattern?
KLT1
MLP
101 dim
KLTn
Mean &
Variance normalized,
Hamming windowed
critical band
trajectory
MLP
Pre-processing by 2-D operators
with subsequent TRAP-TANDEM
*
time
differentiate f
average t
differentiate t
average f
diff upwards
av downwards
diff downwards
av upwards
1 2 1
0 0 0
-1 -2 -1
-1 0 1
-2 0 2
-1 0 1
0 1 2
-1 0 1
-2 -1 0
-2 -1 0
-1 0 1
0 1 2
IDIAP Directions:
Phase AutoCorrelation Features
Traditional Features: Autocorrelation based.
Very sensitive to additive noise, other variations.
Phase AutoCorrelation (PAC):

if R k , k  0,1,..., N 1. represents autocorrelation
coeffs derived from a frame of length N 1
PACs:
Pk   cos
-1
 Rk  

 ,
 R0 
R0  Frame energy.
Entropy Based MultiStream Combination
• Combination of evidences from more than one
expert to improve performance
• Entropy as a measure of confidence
• Experts having low entropy are more reliable
as compared to experts having high entropy
• Inverse entropy weighting criterion
• Relationship between entropy of the resulting
(recombined) classifier and recognition rate
ICSI Directions:
Posterior Combination Framework
• Combination of Several Discriminative Probability Streams
Improvement of the Combo
Infrastructure
• Improve basic features:
 Add prosodic features: voicing level, energy continuity,
 Improve PLP by further removing the pitch difference among
speakers.
• Tandem
 Different targets, different training features. E.g.: word
boundary.
• Improve TRAP (OGI)
• Combination
 Entropy based, accuracy based stream weighting or stream
selection.
New types of tandem features:
Possible word/syllable boundary
Processing
NN
Input
feature
Input feature:
• Traditional or improved PLP
• Spectral continuity
• Voicing, voicing continuity
• Formant continuity feature
• …more
Target
posterior
• Phonemes
• Word/syllable
boundary
• Broad phoneme
classes
• Manner/ place /
articulation… etc
Data Driven Subword Unit
Generation (IDIAP/ICSI)
• Motivation:
 Phoneme-based units may not be optimal for ASR.
• Approach (based on speaker segmentation method):
Initial segmentation:
large number of clusters
Is thresholdless BIC-like
merging criterion met?
Yes
No Stop
Merge, re-segment, and re-estimate
Summary
• Staff and tools in place to proceed with
core experiments
• Pilot experiments provided coherent
substrate for cooperation between 6 sites
• Future directions for individual sites are
all over the map, which is what we want
• Possible exploration of collaborations
w/MS in this meeting
Download