WS06AFSR_summary_for_CSTR

advertisement
Articulatory Feature-Based Speech Recognition
word
word
ind1
ind1
U1
U1
sync1,2
sync1,2
S1
S1
ind2
ind2
U2
U2
sync2,3
sync2,3
S2
S2
ind3
ind3
U3
U3
S3
S3
JHU WS06 Final team presentation
August 17, 2006
Project Participants
Team members:
Karen Livescu (MIT)
Özgür Çetin (ICSI)
Mark Hasegawa-Johnson (UIUC)
Simon King (Edinburgh)
Nash Borges (DoD, JHU)
Chris Bartels (UW)
Arthur Kantor (UIUC)
Partha Lal (Edinburgh)
Lisa Yung (JHU)
Ari Bezman (Dartmouth)
Stephen Dawson-Haggerty (Harvard)
Bronwyn Woods (Swarthmore)
Advisors/satellite members:
Jeff Bilmes (UW), Nancy Chen (MIT), Xuemin Chi (MIT), Trevor Darrell (MIT),
Edward Flemming (MIT), Eric Fosler-Lussier (OSU), Joe Frankel
(Edinburgh/ICSI), Jim Glass (MIT), Katrin Kirchhoff (UW), Lisa Lavoie
(Elizacorp, Emerson), Mathew Magimai (ICSI), Daryush Mehta (MIT), Kate
Saenko (MIT), Janet Slifka (MIT), Stefanie Shattuck-Hufnagel (MIT), Amar
Subramanya (UW)
Why are we here?
 Why articulatory feature-based ASR?
 Improved modeling of co-articulation
 Potential savings in training data
 Compatibility with more recent theories of phonology (autosegmental
phonology, articulatory phonology)
 Application to audio-visual and multilingual ASR
 Improved ASR performance with feature-based observation models in
some conditions [e.g. Kirchhoff ‘02, Soltau et al. ‘02]
 Improved lexical access in experiments with oracle feature
transcriptions [Livescu & Glass ’04, Livescu ‘05]
 Why now?
 A number of sites working on complementary aspects of this idea: U.
Edinburgh (King et al.), UIUC (Hasegawa-Johnson et al.), (Livescu et al.)
 Recently developed tools (e.g. GMTK) for systematic exploration of the
model space
Definitions: Pronunciation and observation modeling
language model
P(w)
w = “makes sense...”
pronunciation
model
P(q|w)
q = [ m m m ey1 ey1 ey2 k1 k1 k1 k2 k2 s ... ]
observation
model
P(o|q)
o =
Feature set for observation modeling
pl 1
dg 1
nas
glo
rd
vow
LAB, LAB-DEN, DEN, ALV, POST-ALV, VEL, GLO,
RHO, LAT, NONE, SIL
VOW, APP, FLAP, FRIC, CLO, SIL
+, ST, IRR, VOI, VL, ASP, A+VO
+, aa, ae, ah, ao, aw1, aw2, ax, axr, ay1, ay2, eh, el,
em, en, er, ey1, ey2, ih, ix, iy, ow1, ow2, oy1, oy2,
uh, uw, ux, N/A
SVitchboard
Data: SVitchboard - Small Vocabulary Switchboard
 SVitchboard [King, Bartels & Bilmes, 2005] is a collection of
small-vocabulary tasks extracted from Switchboard 1
 Closed vocabulary: no OOV issues
 Various tasks of increasing vocabulary sizes: 10, … 500 words
 Pre-defined train/validation/test sets
 and 5-fold cross-validation scheme
 Utterance fragments extracted from SWB 1
 always surrounded by silence
 Word alignments available (msstate)
 Whole word HMM baselines already built
SVitchboard = SVB
SVitchboard: amount of data
Vocabulary
size
Utterances
Word tokens
Duration
Duration
(total, hours) (speech, hours)
10
6775
7792
3.2
0.9
25
9778
13324
4.7
1.4
50
12442
20914
6.2
1.9
100
14602
28611
7.5
2.5
250
18933
51950
10.5
4.0
500
23670
89420
14.6
6.4
SVitchboard: example utterances
 10 word task





oh
right
oh really
so
well the
 500 word task
 oh how funny
 oh no
 i feel like they need a big home a nice place where someone can
have the time to play with them and things but i can't give them
up
 oh
 oh i know it's like the end of the world
 i know i love mine too
SVitchboard: isn’t it too easy (or too hard)?
 No (no).
 Results on the 500 word task test set using a recent SRI
system:
First pass
42.4% WER
After adaptation
26.8% WER
 SVitchboard data included in the training set for this system
 SRI system has 50k vocab
 System not tuned to SVB in any way
SVitchboard: what is the point of a 10 word task?
 Originally designed for debugging purposes
 However, results on the 10 and 500 word tasks obtained in this
workshop show good correlation between WERs on the two
tasks:
WER on 500 word task vs 10 word task
WER (%) 500 word task
85
80
75
70
65
60
55
50
15
17
19
21
23
25
WER (%) 10 word task
27
29
SVitchboard: pre-existing baseline word error
rates
 Whole word HMMs trained on SVitchboard
 these results are from [King, Bartels & Bilmes, 2005]
 Built with HTK
 Use MFCC observations
Vocabulary
Full validation
set
Test set
10 word
20.2
20.8
500 word
69.8
70.8
SVitchboard: experimental technique
 We only perfomed task 1 of SVitchboard (the first of 5 crossfold sets)
 Training set is known as “ABC”
 Validation set is known as “D”
 Test set is known as “E”
 SVitchboard defines cross-validation sets
 But these were too big for the very large number of
experiments we ran
 We mainly used a fixed 500 utterance randomly-chosen
subset of “D” which we call the small validation set
 All validation set results reported today are on this set,
unless stated otherwise
SVitchboard: experimental technique
 SVitchboard includes word alignments.
 We found that using these made training significantly
faster, and gave improved results in most cases
 Word alignments are only ever used during training
Word
alignments?
Validation set
Test set
without
65.1
67.7
with
62.1
65.0
 Results above is for a monophone HMM with PLP
observations
SVitchboard: workshop baseline word error rates
 Monophone HMMs trained on SVitchboard
 PLP observations
Vocabulary
Small
validation set
Full validation
set
Test set
10 word
16.7
18.7
19.6
500 word
62.1
-
65.0
SVitchboard: workshop baseline word error rates
 Triphone HMMs trained on SVitchboard
 PLP observations
 500 word task only
System
Small
validation set
Validation set
Test set
HTK
-
56.4
61.2
GMTK /
gmtkTie
56.1
-
59.2
 (GMTK system was trained without word alignments)
SVitchboard: baseline word error rates summary
 Test set word error rates
Model
10 word
500 word
Whole word
20.8
70.8
Monophone
19.6
65.0
HTK triphone
61.2
GMTK triphone
59.2
gmtkTie
gmtkTie
 General parameter clustering and tying tool for GMTK
 Written for this workshop
 Currently most developed parts:
 Decision-tree clustering of Gaussians, using same
technique as HTK
 Bottom-up agglomerative clustering
 Decision-tree tying was tested in this workshop on various
observation models using Gaussians
 Conventional triphone models
 Tandem models, including with factored observation
streams
 Feature based models
 Can tie based on values of any variables in the graph,
not just the phone state (e.g. feature values)
gmtkTie
 gmtkTie is more general than HTK HHEd
 HTK asks questions about previous/next phone identity
 HTK clusters states only within the same phone
 gmtkTie can ask user-supplied questions about usersupplied features: no assumptions about states,
triphones, or anything else
 gmtkTie clusters user-defined groups of parameters, not
just states
 gmtkTie can compute cluster sizes and centroids in lots
of different ways
 GMTK/gmtkTie triphone system built in this workshop is at
least as good as HTK system
gmtkTie: conclusions
 It works!
 Triphone performance at least as good as HTK
 Can cluster arbitrary groups of parameters, asking questions
about any feature the user can supply
 Later in this presentation, we will see an example of
separately clustering the Gaussians for two observation
streams
 Opens up new possibilities for clustering
 Much to explore:
 Building different decision trees for various factorings of the
acoustic observation vector
 Asking questions about other contextual factors
Hybrid models
Hybrid models: introduction
 Motivation
 Want to use feature-based representation
 In previous work, we have successfully recovered feature values
from continuous speech using neural networks (MLPs)
 MLPs alone are just frame-by-frame classifiers
 Need some “back end” model to decode their output into words
 Ways to use such classifiers
 Hybrid models
 Tandem observations
Hybrid models: introduction
 Conventional HMMs generate observations via a likelihood
p(O|state) or p(O|class) using a mixture of Gaussians
 Hybrid models use another classifier (typically an MLP) to
obtain the posterior P(class|O)
 Dividing by the prior gives the likelihood, which can be used
directly in the HMM: no Gaussians required
Hybrid models: introduction
 Advantages of hybrid models include:
 Can easily train the classifier discriminatively
 Once trained, MLPs will compute P(class|O) relatively fast
 MLPs can use a long window of acoustic input frames
 MLPs don’t require input feature distribution to have
diagonal covariance (e.g. can use filterbank outputs from
computational auditory scene analysis front-ends)
Hybrid models: standard method
 Standard phone-based hybrid
 Train an MLP to classify phonemes, frame by frame
 Decode the MLP output using simple HMMs for smoothing
(transition probabilities easily derived from phone duration
statistics – don’t even need to train them)
Hybrid models: our method
 Feature-based hybrid
 Use ANNs to classify articulatory features instead of phones
 8 MLPs, classifying pl1, dg1, etc frame-by-frame
One of the motivations for using features is that
it should be easier to build a multi-lingual /
cross-language system this way
Hybrid models: using feature-classifying MLPs
p(dg1 | phoneState) = Non-deterministic CPT (learned)
phoneState
dg1
...
pl1
rd
dummy
variable
MLPs provide “virtual evidence” here
Hybrid models: training the MLPs
 We use MLPs to classify speech into AFs, frame-by-frame
 Must obtain targets for training
 These are derived from phone labels
 obtained by forced alignment using the SRI recogniser
 this is less than ideal, but embedded training might
help (results later)
 MLPs were trained by Joe Frankel (Edinburgh/ICSI) & Mathew
Magimai (ICSI)
 Standard feedforward MLPs
 Trained using Quicknet
 Input to nets is a 9-frame window of PLPs (with VTLN
and per-speaker mean and variance normalisation)
Hybrid models: training the MLPs
 Two versions of MLPs were initially trained
 Fisher
 Trained on all of Fisher but not on any data from
Switchboard 1
 SVitchboard
 Trained only on the training set of SVB
 The Fisher nets performed better, so were used in all hybrid
experiments
Hybrid models: MLP details
MLP architecture is:
input units x hidden units x output units
Feature
MLP architecture
glo
351 x 1400 x 4
dg1
351 x 1600 x 6
nas
351 x 1200 x 3
pl1
351 x 1900 x 10
rou
351 x 1200 x 3
vow
351 x 2400 x 23
fro
351 x 1700 x 7
ht
351 x 1800 x 8
Hybrid models: MLP overall accuracies
 Frame-level accuracies
 MLPs trained on Fisher
 Accuracy computed with
respect to SVB test set
 Silence frames
excluded from this
calculation
 More detailed analysis
coming up later…
Feature
Accuracy (%)
glo
85.3
dg1
73.6
nas
92.7
pl1
72.6
rou
84.7
vow
65.6
fro
69.2
ht
68.0
Hybrid models: experiments
 Using MLPs trained on Fisher using original phone-derived
targets
vs.
 Using MLPs retrained on SVB data, which has been aligned
using one of our models
 Hybrid model
vs
 Hybrid model plus PLP observation
Hybrid models: experiments – basic model
 Basic model is trained on
activations from original
MLPs (Fisher-trained)
 The only parameters in
this DBN are the
conditional probability
tables (CPTS) describing
how each feature
depends on phone state
 Embedded training
 Use the model to realign
the SVB data (500 word
task)
 Starting from the Fishertrained nets, retrain on
these new targets
 Retrain the DBN on the
new net activations
phoneState
dg1
...
pl1
rd
Model
Small
validation
set
Test set
Hybrid
26.0
30.1
Hybrid,
embedded
training
23.1
24.3
Hybrid models: 500 word results
Small validation set
hybrid
66.6
hybrid, embedded
training
62.6
Hybrid models: adding in PLPs
 To improve accuracy, we combined the “pure” hybrid model
with a standard monophone model
 Can/must weight contribution of PLPs
 Used a global weight on each of the 8 virtual evidences,
and a fixed weight on PLPs of 1.0
 Weight tuning worked best if done both during training
and decoding
 Computationally expensive: must train and cross-validate
many different systems
Hybrid models: adding PLPs
p(dg1 | phoneState) = Non-deterministic CPT (learned)
phoneState
dg1
...
pl1
rd
PLPs
dummy
variable
MLP likelihoods (implemented via virtual evidence in GMTK)
Hybrid models: weighting virtual evidence vs PLP
Word error rate (%)
19
WER (%)
18.5
18
17.5
17
16.5
0
0.2
0.4
0.6
0.8
1
Weight on virtual evidence
1.2
1.4
1.6
Hybrid models: experiments – basic model + PLP
 Basic model is augmented
with PLP observations
 Generated from
mixtures of Gaussians,
initialised from a
conventional
monophone model
phoneState
dg1
...
pl1
rd
PLPs
 A big improvement over
hybrid-only model
 A small improvement over
the PLP-only monophone
model
Model
Small
validation
set
Test set
Hybrid
26.0
30.1
PLP only
16.9
20.0
Hybrid +
PLP
16.2
19.6
Hybrid experiments: conclusions
 Hybrid models perform reasonably well, but not yet as well
as conventional models
 But they have fewer parameters to be trained
 So may be a viable approach for small databases:
 Train MLPs on large database (e.g. Fisher)
 Train hybrid model on small database
 Cross-language??
 Embedded training gives good improvements for the “pure”
hybrid model
 Hybrid models augmented with PLPs perform better than
baseline PLP-only models
 But improvement is only small
 The best way to use the MLPs trained on Fisher might be to
construct tandem observation vectors…
Using MLPs to transfer knowledge from larger
databases
 Scenario
 we need to build a system for a domain/accent/language
for which we have only a small amount of data
 We have lots of data from other
domains/accents/languages
 Method
 Train MLP on large database
 Use it in either a hybrid or a tandem system in target
domain
Using MLPs to transfer knowledge from larger
databases
 Articulatory features
 It is plausible that training MLPs to be AF classifiers could
be more accent/language independent than phones
 Tandem results coming up shortly will show that, across
very similar domains (Fisher & SVB), AF nets perform as
well or better than phone nets
Hybrid models vs Tandem observations
 Standard hybrid
 Train an MLP to classify phonemes, frame by frame
 Decode the MLP output using simple HMMs (transition
probabilities easily derived from phone duration statistics –
don’t even need to train them)
 Standard tandem
 Instead of using MLP output to directly obtain the likelihood,
just use it as a feature vector, after some transformations
(e.g. taking logs) and dimensionality reduction
 Append the resulting features to standard features, e.g. PLPs
or MFCCs
 Use this vector as the observation for a standard HMM with a
mixture-of-Gaussians observation model
 Currently used in state-of-art systems such as from SRI
but first a look at structural modifications . . .
Adding dependencies
 Consider the set of random variables constant
 Take an existing model and augment it with edges to
improve performance on a particular task
 Choose edges greedily, or using a discriminative metric like
the Explaining Away Residue [EAR, Blimes 1998]
 One goal is to compare these approaches on our models
EAR(X,Y)=I(X,Y|Q) – I(X,Y)
For instance, let
X = word random variable
Y = degree 1 random variable
Q = {“right”, “another word”, “silence”}
 Provides an indication of how valuable it is to model X and Y
jointly
Models
Monophone hybrid + PLP
Monophone hybrid
word
word
phoneState
dg1
pl1 rou
...
MLP likelihoods
(implemented via virtual
evidence in GMTK)
phoneState
pl1 rou
...
PLPs
Which edges?
 Learn connections from
classifier outputs to
word
 Intuition that the word
will be able to use the
classifier output to
correct other mistakes
being made
word
phoneState
pl1 rou
dg1
...
Results (10 word monophone hybrid)
 Baseline monophone hybrid: 26.0% WER on CV, 30.0% Test
 Choose edge with best CV score: ROU (25.1%)
 Test result: 29.7% WER
 Choose two single best edges on CV: VOW + ROU
 Test result: 29.9%
 Choose edge with highest EAR: GLO
 Test result: 30.1%
 Choose highest EAR between MLP features: DG1 ↦ PL1
 Test result: 31.6% (CV: 26.0%)
 In monophone + PLP model , the best result is obtained with
original model.
In Conclusion
 The EAR measure would not have chosen the best possible
edges
 These models may already be optimized
 Once PLPs are added to the model, changing the structure
has little effect
Tandem observation
models
Introduction
 Tandem is a method to use the predictions of a MLP as
observation vectors in generative models, e..g. HMMs
 Extensively used in the ICSI/SRI systems: 10-20 %
improvement for English, Arabic, and Mandarin
 Most previous work used phone MLPs for deriving tandem
(e.g., Hermansky et al. ’00, and Morgan et al. ‘05 )
 We explore tandem based on articulatory MLPs
 Similar to the approach in Kirchhoff ’99
 Questions
 Are articulatory tandems better than the phonetic ones?
 Are factored observation models for tandem and
acoustic (e.g. PLP) observations better than the
observation concatenation approaches?
Tandem Processing Steps
MLP OUTPUTS
LOGARITHM
PRINCIPAL
COMPONENT ANALYSIS
SPEAKER MEAN/VAR
NORMALIZATION
TANDEM FEATURE
 MLP posteriors are processed to
make them Gaussian like
 There are 8 articulatory MLPs;
their outputs are joined
together at the input (64 dims)
 PCA reduces dimensionality to
26 (95% of the total variance)
 Use this 26-dimensional vector
as acoustic observations in an
HMM or some other model
 The tandem features are usually
used in combination w/ a
standard feature, e.g. PLP
Tandem Observation Models
 Feature concatenation: Simply append tandems to PLPs
- All of the standard modeling methods applicable to this meta
observation vector (e.g., MLLR, MMIE, and HLDA)
 Factored models: Tandem and PLP distributions are
factored at the HMM state output distributions
- Potentially more efficient use of free parameters, especially if
streams are conditionally independent
- Can use e.g., separate triphone clusters for each observation
Concatenated Observations
State
PLP
Tandem
Factored Observations
State
PLP
Tandem
p(X, Y|Q) = p(X|Q) p(Y|Q)
Articulatory vs. Phone Tandems
Model
PLP
PLP/Phone Tandem (SVBD)
PLP/Articulatory Tandem (SVBD)
Test WER (%)
67.7
63.0
62.3
PLP/Articulatory Tandem (Fisher)
59.7
 Monophones on 500 vocabulary task w/o alignments;
feature concatenated PLP/tandem models
 All tandem systems are significantly better than PLP
alone
 Articulatory tandems are as good as phone tandems
 Articulatory tandems from Fisher (1776 hrs) trained
MLPs outperform those from SVB (3 hrs) trained MLPs
Concatenation vs. Factoring
Model
PLP
PLP / Tandem Concatenation
PLP x Tandem Factoring
PLP
PLP / Tandem Concatenation
PLP x Tandem Factoring
Task
10
500
Test WER (%)
24.5
21.1
19.7
67.7
59.7
59.1
 Monophone models w/o alignments
 All tandem results are significant over PLP baseline
 Consistent improvements from factoring; statistically
significant on the 500 task
Triphone Experiments
Model
# of Clusters
PLP
477
PLP / Tandem Concatenation
880
PLP x Tandem Factoring
467x641
Test WER %
59.2
55.0
53.8
 500 vocabulary task w/o alignments
 PLP x Tandem factoring uses separate decision trees
for PLP and Tandem, as well as factored pdf’s
 A significant improvement from factoring over the
feature concatenation approach
 All pairs of results are statistically significant
Summary
 Tandem features w/ PLPs outperform PLPs alone for both
monophones and triphones
 8-13 % relative improvements (statistically significant)
 Articulatory tandems are as good as phone tandems
- Further comparisons w/ phone MLPs trained on Fisher
 Factored models look promising (significant results on
the 500 vocabulary task)
- Further experiments w/ tying, initialization
- Judiciously selected dependencies between the factored
vectors, instead of complete independence
Manual feature transcriptions
 Main transcription guideline: The output should correspond to
what we would like our AF classifiers to detect
 Details
 2 transcribers: phonetician (Lisa Lavoie), PhD student in speech
group (Xuemin Chi)
 78 SVitchboard utterances
 9 utterances from Switchboard Transcription Project for comparison
 Multipass transcription using WaveSurfer (KTH)
 1st pass: Phone-feature hybrid
 2nd pass: All-feature
 3rd pass: Discussion, error-correction
 Some basic statistics
 Overall speed ~1000 x real-time
 High inter-transcriber agreement (93% avg. agreement, 85% avg.
string accuracy)
 First use to date of human-labeled articulatory feature data for
classifier/recognizer testing
GMTKtoWavesurfer Debugging/Visualization Tool
 Input
 Output
 Per-utterance files
containing Viterbidecoded variables
 List of variables
 Optional map between
integer values and
labels
 Optional reference
transcriptions for
comparison
 Per-utterance, perfeature Wavesurfer(KTH)
transcription files
 Wavesurfer
configuration for
viewing the decoded
variables, and optionally
comparing to a
reference transcription
 General debugging/visualization
for any GMTK model
Summary
 Analysis
 Improved forced AF alignments obtained using MSState word
alignments combined with new AF-based models
 MLP performance analysis shows that retrained classifiers move
closer to human alignments, farther from forced phonetic alignments
 Data
 Manual transcriptions
 PLPs and MLP outputs for all of SVitchboard
 New, improved SVitchboard baselines (monophone & triphone)
 Tools
 gmtkTie
 Viterbi path analysis tool
 Site-independent parallel GMTK training and decoding scripts
Acknowledgments
Jeff Bilmes (UW), Nancy Chen (MIT), Xuemin Chi (MIT), Trevor Darrell
(MIT), Edward Flemming (MIT), Eric Fosler-Lussier (OSU), Joe Frankel
(Edinburgh/ICSI), Jim Glass (MIT), Katrin Kirchhoff (UW), Lisa Lavoie
(Elizacorp, Emerson), Mathew Magimai (ICSI), Daryush Mehta (MIT),
Florian Metze (Deutsche Telekom), Kate Saenko (MIT), Janet Slifka (MIT),
Stefanie Shattuck-Hufnagel (MIT), Amar Subramanya (UW)
Support staff at
SSLI Lab, U. Washington
IFP, U. Illinois, Urbana-Champaign
CSTR, U. Edinburgh
ICSI
SRI
NSF
DARPA
DoD
Fred Jelinek
Laura Graham
Sanjeev Khudanpur
Jason Eisner
CLSP
Download