Deep Neural Network Acoustic Models

advertisement
Deep Neural Network Acoustic Models
Steve Renals
Automatic Speech Recognition – ASR Lecture 12
25 February 2016
ASR Lecture 12
Deep Neural Network Acoustic Models
1
Recap
ASR Lecture 12
Deep Neural Network Acoustic Models
2
Hybrid NN/HMM
Utterance
"Don’t Ask"
DON’T
d
oh
ASK
n
t
ah
s
Word
k
Subword (phone)
Acoustic model (HMM)
3x39=117 phone states
8000
~1000 hidden units
6000
freq (Hz)
1 hidden layer
4000
Speech Acoustics
2000
9x39 MFCC inputs
x(t-4)
x(t-3)
…
x(t)
0
0
…
200
x(t+3)
400
600
800
time (ms)
1000
1200
1400
x(t+4)
ASR Lecture 12
Deep Neural Network Acoustic Models
3
HMM/NN vs HMM/GMM
Advantages of NN:
Can easily model correlated features
Correlated feature vector components (eg spectral features)
Input context – multiple frames of data at input
More flexible than GMMs – not made of (nearly) local
components); GMMs inefficient for non-linear class boundaries
NNs can model multiple events in the input simultaneously –
different sets of hidden units modelling each event; GMMs
assume each frame generated by a single mixture component.
NNs can learn richer representations and learn ‘higher-level’
features (tandem, posteriorgrams, bottleneck features)
ASR Lecture 12
Deep Neural Network Acoustic Models
4
HMM/NN vs HMM/GMM
Advantages of NN:
Can easily model correlated features
Correlated feature vector components (eg spectral features)
Input context – multiple frames of data at input
More flexible than GMMs – not made of (nearly) local
components); GMMs inefficient for non-linear class boundaries
NNs can model multiple events in the input simultaneously –
different sets of hidden units modelling each event; GMMs
assume each frame generated by a single mixture component.
NNs can learn richer representations and learn ‘higher-level’
features (tandem, posteriorgrams, bottleneck features)
Disadvantages of NN:
Until ∼ 2012:
Context-independent (monophone) models, weak speaker
adaptation algorithms
NN systems less complex than GMMs (fewer parameters):
RNN – < 100k parameters, MLP – ∼ 1M parameters
Computationally expensive - more difficult to parallelise
training than GMM systems
ASR Lecture 12
Deep Neural Network Acoustic Models
4
Deep Neural Network
Acoustic Models
ASR Lecture 12
Deep Neural Network Acoustic Models
5
Deep neural networks (DNNs) — Hybrid system
CD
Phone Outputs
12000
3–8 hidden layers
Hidden units
2000
MFCC Inputs
(39*9=351)
ASR Lecture 12
Deep Neural Network Acoustic Models
6
DNNs — what’s new?
Training multi-hidden layers directly with gradient descent is
difficult — sensitive to initialisation, gradients can be very
small after propagating back through several layers.
Unsupervised pretraining
Train a stacked restricted Boltzmann machine generative
model (unsupervised), then finetune with backprop
Contrastive divergence training
Layer-by-layer training
Successively train deeper networks, each time replacing output
layer with hidden layer and new output layer
Many hidden layers
GPUs provide the computational power
Wide output layer (context dependent phone classes)
GPUs provide the computational power
(Hinton et al 2012)
ASR Lecture 12
Deep Neural Network Acoustic Models
7
Unsupervised pretraining
DBN-DNN
RBM
RBM
GRBM
Copy
Copy
DBN
W3
W2
W1
W4 = 0
W3
W3
W2
W2
W1
W1
T
T
T
[FIG1] The sequence of operations used to create a DBN with three hidden layers and to convert it to a pretrained DBN-DNN. First, a
GRBM is trained to model a window of frames of real-valued acoustic coefficients. Then the states of the binary hidden units of the
GRBM are used as data for training an RBM. This is repeated to create as many hidden layers as desired. Then the stack of RBMs is
converted to a single generative model, a DBN, by replacing the undirected connections of the lower level RBMs by top-down, directed
connections. Finally, a pretrained DBN-DNN is created by adding a “softmax” output layer that contains one unit for each possible state
of each HMM. The DBN-DNN is then discriminatively trained to predict the HMM state corresponding to the central frame of the input
window in a forced alignment.
Hinton et al (2012)
INTERFACING A DNN WITH AN HMM
After it has been discriminatively fine-tuned, a DNN outputs
probabilities of the form p (HMMstate ; AcousticInput) . But to
compute a Viterbi alignment or to run the forward-backward
Lecture
12
algorithm within the HMM framework, weASR
require
the likeli-
ing point for developing a new approach, especially one that
requires a challenging amount of computation.
Mohamed et. al. [12] showed that a DBN-DNN acoustic
model outperformed the best published recognition results on
DeepatNeural
Network
Acoustic
Models
TIMIT
about the
same time
as Sainath
et. al. [23] achieved a
8
Example: hybrid HMM/DNN phone recognition (TIMIT)
Train a ‘baseline’ three state monophone HMM/GMM system
(61 phones, 3 state HMMs) and Viterbi align to provide DNN
training targets (time state alignment)
The HMM/DNN system uses the same set of states as the
HMM/GMM system — DNN has 183 (61*3) outputs
Hidden layers — many experiments, exact sizes not highly
critical
3–8 hidden layers
1024–3072 units per hidden layer
Multiple hidden layers always work better than one hidden
layer
Pretraining always results in lower error rates
Best systems have lower phone error rate than best
HMM/GMM systems (using state-of-the-art techniques such
as discriminative training, speaker adaptive training)
ASR Lecture 12
Deep Neural Network Acoustic Models
9
Acoustic features for NN acoustic models
GMMs: filter bank features (spectral domain) not used as they
are strongly correlated with each other – would either require
full covariance matrix Gaussians
many diagonal covariance Gaussians
DNNs do not require the components of the feature vector to
be uncorrelated
Can directly use multiple frames of input context (this has
been done in NN/HMM systems since 1990!)
Can potentially use feature vectors with correlated components
(e.g. filter banks)
Experiments indicate that filter bank features result in greater
accuracy than MFCCs
ASR Lecture 12
Deep Neural Network Acoustic Models
10
sionality of the input is not the crucial property (see p. 3).
TIMIT phone error rates: effect of depth and feature type
25
fbank−hid−2048−15fr−core
fbank−hid−2048−15fr−dev
mfcc−hid−3072−16fr−core
mfcc−hid−3072−16fr−dev
Phone error rate (PER)
24
23
22
21
20
19
18
1
2
3
4
5
6
7
8
Number of layers
(Mohamed et al (2012))
Fig. 2. PER as a function of the number of layers.
ASR Lecture 12
Deep Neural Network Acoustic Models
11
Visualising neural networks
How to visualise NN layers? “t-SNE” (stochastic neighbour
embedding using t-distribution) projects high dimension
vectors (e.g. the values of all the units in a layer) into 2
dimensions
t-SNE projection aims to keep points that are close in high
dimensions close in 2 dimensions by comparing distributions
over pairwise distances between the high dimensional and 2
dimensional spaces – the optimisation is over the positions of
points in the 2-d space
ASR Lecture 12
Deep Neural Network Acoustic Models
12
−100
−80
−60
−40
−20
0
20
40
60
80
100
a
features for 6 speakers. Crosses refer to one utterance and circles ret-SNE
2-D
map ofcolours
the 1strefer
layerto of
the fine-tuned
in
fer to theFig.
other5.one,
while
different
different
speak- hidd
activity
vectors
using
fbank
inputs.
ers. We removed the data points of the other 18 speakers to make the
map less cluttered.
FeatureFig. vector
(input layer): t-SNE visualisation
3. t-SNE 2-D map of fbank feature vectors
100
150
100
80
80
60
100
60
40
40
20
50
20
0
0
0
−20
−20
−40
−50
−40
−60
−60
−80
−100
−80
−100
−100
−100
−100
−80
−60
−40
−20
0
20
40
60
80
100
−150
MFCC
−100
−80
−80
−60
−60
−40
−40
−20
−20
0
0
20
20
40
40
60
60
80
80
100
100
F
a
Fig. 6. t-SNE 2-D map of the 8th layer of the fine-tuned hidd
activity
usingmap
fbank
inputs.feature vectors
Fig. 3. vectors
t-SNE 2-D
of fbank
To refute the hypothesis that fbank features yield lower PE
because of their higher dimensionality, we consider dct featur
which are the same as fbank features except that they are tra
FBANK
(Mohamed et al (2012))
Visualisation of 2 utterances (cross and circle) spoken by 6
speakers (colours)
MFCCs are more scattered than FBANK
FBANK has more local structure than MFCCs
Fig. 4. t-SNE 2-D map of MFCC feature vectors
MFCC vectors tend to be scattered all over the space as they have
decorrelated elements while fbank feature vectors have stronger similarities and are often aligned between different speakers for some
100
80
60
40
20
0
−20
ASR Lecture 12
Deep Neural Network Acoustic Models
13
and 8 show the 1st and 8th layer features of fine-tuned DBNs trained
with fbank and MFCC respectively. As we go higher in the network,
hidden activity vectors from different speakers for the same segment
align in both the MFCC and fbank cases but the alignment is stronger
in the fbank case.
ng), we used SA utterances from the TIMIT core test set speakers.
These are the two utterances that were spoken by all 24 different
peakers. Figures 3 and 4 show visualizations of fbank and MFCC
eatures for 6 speakers. Crosses refer to one utterance and circles reer to the other one, while different colours refer to different speakrs. We removed the data points of the other 18 speakers to make the
map less cluttered.
100
First hidden layer: t-SNE visualisation
100
80
80
150
60
60
40
40
100
20
20
0
50
this claim, we padded the MFCC vector with random noise to b
the same dimensionality as the dct features and then used them
network training (MFCC+noise row in table 2). The MFCC pe
mance was degraded by padding with noise. So it is not the hi
dimensionality that matters but rather how the discriminant info
tion is distributed over these dimensions.
Table 2. The PER deep nets using different features
0
−20
Feature
fbank
dct
MFCC
MFCC+noise
−20
0
−40
−40
−60
−60
−50
−80
−100
−150
−100
−150
−100
Dim
123
123
39
123
1lay
23.5%
26.0%
24.3%
26.3%
2lay
22.6%
23.8%
23.7%
24.3%
3lay
22.7%
24.6%
23.8%
25.1%
−80
−100
−50
0
50
100
−100
−150
MFCC
−100
−50
5. CONCLUSIONS
0
50
100
FBANK
A DBN acoustic model has three main properties: It is a n
−80
−60
−40
−20
0
20
40
60
80
100
network, it has many layers of non-linear features, and it is
Fig. 7. t-SNE 2-D map of the 1st layer of the fine-tuned hidden
Fig. 5. t-SNE
2-D
of the 1stmodel.
layer of
trained
as map
a generative
In the
thisfine-tuned
paper we hidden
investigated
activity vectors using MFCC inputs.
activity vectors
using
fbank
each of
these
threeinputs.
properties contributes to good phone recogn
Fig. 3. t-SNE 2-D map of fbank feature vectors
on TIMIT. Additionally, we examined different types of input
150
resentation for DBNs by comparing recognition rates and als
visualising the similarity structure of the input vectors and the
den activity vectors. We concluded that log filter-bank feature
100
the most suitable for DBNs because they better utilize the abili
the neural net to discover higher-order structure in the input dat
(Mohamed et al (2012))
Visualisation of 2 utterances (cross and circle) spoken by 6
speakers (colours)
Hidden layer vectors start to align more between speakers for
FBANK
100
100
80
60
80
60
40
50
40
20
20
0
0
0
ASR Lecture 12
6. REFERENCES
H. Bourlard
and
N. Morgan,
Connectionist Speech Recogni
Deep[1]
Neural
Network
Acoustic
Models
14
−150
−80
−60
−40
−20
0
20
40
60
80
100
A DBN acoustic model has three main properties: It is a ne
Fig. 5. t-SNE
2-Ditmap
the layers
1st layer
the fine-tuned
hidden
network,
has of
many
of of
non-linear
features,
and it is p
Fig. 7. t-SNE 2-D map of the 1st layer of the fine-tuned hidden
activity vectors
inputs.model. In this paper we investigated h
trainedusing
as a fbank
generative
activity vectors using MFCC inputs.
each of these three properties contributes to good phone recogni
Fig. 3. t-SNE 2-D map of fbank feature vectors
on TIMIT. Additionally, we examined different types of input r
150
resentation for DBNs by comparing recognition rates and also
visualising the similarity structure of the input vectors and the h
den activity vectors. We concluded that log filter-bank features
100
the most suitable for DBNs because they better utilize the ability
the neural net to discover higher-order structure in the input data
−100
Eighth hidden layer: t-SNE visualisation
100
100
80
60
80
60
40
40
50
20
6. REFERENCES
20
0
0
0
−20
−20
−40
−50
−60
−60
−80
−80
−100
[1] H. Bourlard and N. Morgan, Connectionist Speech Recogniti
A Hybrid Approach, Kluwer Academic Publishers, 1993.
−40
−100
−100
−100
−80
−80
−60
−60
−40
−40
−20
−20
0
0
20
20
40
MFCC
40
60
60
80
80
100
100
−100
−100
[2] H. Hermansky, D. Ellis, and S. Sharma, “Tandem connectio
feature extraction for conventional HMM systems,” in ICAS
2000, pp. 1635–1638.
[3] G. E. Hinton, S. Osindero, and Y. W. Teh, “A fast learning al
rithm for deep belief nets,” Neural Computation, vol. 18, no
pp. 1527–1554, 2006.
−80
−60
−40
−20
0
20
40
60
80
100
FBANK
(Mohamed et al (2012))
Visualisation of 2 utterances (cross and circle) spoken by 6
speakers (colours)
In the final hidden layer, the hidden layer outputs for the same
phone are well-aligned across speakers for both MFCC and FBANK
– but stronger for FBANK
[4] A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling
Fig. 8. t-SNE 2-D map of the 8th layer of the fine-tuned hidden
Fig. 6. t-SNE 2-D map of the 8th layer of the fine-tuned hidden
ing deep belief networks,” IEEE Transactions on Audio, Spee
activity
using
MFCC
inputs.
activity vectors using fbank inputs.
Fig. 4. vectors
t-SNE 2-D
map
of MFCC
feature vectors
and Language Processing, 2011.
MFCC vectors tend to be scattered all over the space as they have
To refute the hypothesis that fbank features yield lower PER
[5]their
G. Dahl,
Yu, L. Deng, and
A. Acero,dct“Context-depend
decorrelated
elements
fbank cosine
featuretransform,
vectors have
stronger
sim- decorbecause of
higherD.dimensionality,
we consider
features,
formed
using while
the discrete
which
encourages
deep features
neural networks
for they
largeare
vocabulary
ilarities and
are elements.
often aligned
different
some
which are thepre-trained
same as fbank
except that
trans- spe
related
We between
rank-order
the dctspeakers
featuresfor
from
lower-order
recognition,” IEEE Transactions on Audio, Speech, and L
(slow-moving) features to higher-order ones. For the generative preguage Processing, 2011.
training phase, the dct features are disadvantaged because they are
[6] T. N. Sainath, B. Kingsbury, B. Ramabhadran, P. Fousek, P.
not as strongly structured as the fbank features. To avoid a convak, and A. Mohamed, “Making deep belief networks effec
founding effect, we skipped pre-training and performed the comparfor large vocabulary continuous speech recognition,” in AS
ison using only the fine-tuning from random initial weights. Table 2
2011.
shows PER for fbank, dct, and MFCC inputs (11 input frames and
1024 hidden units per layer) in 1, 2, and 3 hidden-layer neural net[7] J.B. Allen, “How do humans process and recognize speec
works. dct features are worse than both fbank features and MFCC
IEEENetwork
Trans. Speech
Audio
Processing, vol. 2, no. 4,15
pp. 5
ASR
Lecture 12 causesDeep Neural
Acoustic
Models
features. This prompts us to ask why a lossless
transformation
Visualising neural networks
How to visualise NN layers? “t-SNE” (stochastic neighbour
embedding using t-distribution) projects high dimension
vectors (e.g. the values of all the units in a layer) into 2
dimensions
t-SNE projection aims to keep points that are close in high
dimensions close in 2 dimensions by comparing distributions
over pairwise distances between the high dimensional and 2
dimensional spaces – the optimisation is over the positions of
points in the 2-d space
Are the differences due to FBANK being higher dimension
(41 × 3 = 123) than MFCC (13 × 3 = 39)?
NO!
Using higher dimension MFCCs, or just adding noise to
MFCCs results in higher error rate
Why? – In FBANK the useful information is distributed over
all the features; in MFCC it is concentrated in the first few.
ASR Lecture 12
Deep Neural Network Acoustic Models
16
Example: hybrid HMM/DNN large vocabulary
conversational speech recognition (Switchboard)
Recognition of American English conversational telephone
speech (Switchboard)
Baseline context-dependent HMM/GMM system
9,304 tied states
Discriminatively trained (BMMI — similar to MPE)
39-dimension PLP (+ derivatives) features
Trained on 309 hours of speech
Hybrid HMM/DNN system
Context-dependent — 9304 output units obtained from Viterbi
alignment of HMM/GMM system
7 hidden layers, 2048 units per layer
DNN-based system results in significant word error rate
reduction compared with GMM-based system
Pretraining not necessary on larger tasks (empirical result)
ASR Lecture 12
Deep Neural Network Acoustic Models
17
multiple epochs as reported in [45
has also been found effective for th
convex network” [51] and “deep st
pretraining is accomplished by co
no generative models.
Purely discriminative training o
dom initial weights works much be
provid
weigh
[TABLE 3] A COMPARISON OF THE PERCENTAGE WERs USING DNN-HMMs AND
amoun
GMM-HMMs ON FIVE DIFFERENT LARGE VOCABULARY TASKS.
availab
HOURS OF
GMM-HMM
GMM-HMM
trainin
TASK
TRAINING DATA DNN-HMM WITH SAME DATA WITH MORE DATA
ately [
SWITCHBOARD (TEST SET 1)
309
18.5
27.4
18.6 (2,000 H)
erative
SWITCHBOARD (TEST SET 2)
309
16.1
23.6
17.1 (2,000 H)
ENGLISH BROADCAST NEWS
50
17.5
18.8
test pe
BING VOICE SEARCH
signifi
(SENTENCE ERROR RATES)
24
30.4
36.2
Lay
GOOGLE VOICE INPUT
5,870
12.3
16.0 (22 5,870 H)
traini
YOUTUBE
1,400
47.6
52.3
using
DBN-DNN ACOUSTIC MODELS ON LVCSR TASKS
Table 3 summarizes the acoustic modeling results described
above. It shows that DNN-HMMs consistently outperform
GMM-HMMs that are trained on the same amount of data,
sometimes by a large margin. For some tasks, DNN-HMMs
also outperform GMM-HMMs that are trained on much
more data.
DNN vs GMM on large vocabulary tasks (Experiments
from 2012)
(Hinton et al (2012))
IEEE SIGNAL PROCESSING MAGAZINE [92] NOVEMBER 2012
ASR Lecture 12
Deep Neural Network Acoustic Models
18
Neural Network Features
ASR Lecture 12
Deep Neural Network Acoustic Models
19
Tandem features (posteriorgrams)
Use NN probability estimates as an additional input feature
stream in an HMM/GMM system —- (Tandem features (i.e.
NN + acoustics), posteriorgrams)
Advantages of tandem features
can be estimated using a large amount of temporal context (eg
up to ±25 frames)
encode phone discrimination information
only weakly correlated with PLP or MFCC features
Tandem features: reduce dimensionality of NN outputs using
PCA, then concatenate with acoustic features (e.g. MFCCs)
PCA also decorrelates feature vector components – important
for GMM-based systems
ASR Lecture 12
Deep Neural Network Acoustic Models
20
Tandem features
PLP
Analysis
One Frame of
PLP Features
Concatenate
PCA
Dimensionality
Reduction
PLP Net
Critical Band
Energy Analysis
GMHMM
Back-End
One Frame of
MLP-Based Features
Nine Frames of PLP Features
Speech Input
One Frame of
Augmented
Features
Single-Frame
Posterior
Combination
Log
51 Frames of Log-Critical
Band Energies
HATs
Processing
[FIG1] Posterior-based feature generation system. Each posterior stream is created by feeding a trained multilayer perceptron (MLP)
with features that have different temporal and spectral extent. The “PLP Net” is trained to generate phone posterior estimates given
roughly 100 ms of telephone bandwidth speech after being processed by PLP analysis over nine frames. HATs processing is trained for
the same goal given 500 ms of log-critical band energies. The two streams of posteriors are combined (in a weighted sum where each
weight is a scaled version of local stream entropy) and transformed as shown to augment the more traditional PLP features. The
augmented feature vector is used as an observation by the Gaussian mixture hidden Markov model (GMHMM) system.
Morgan et al (2005)
recognition systems (SRSs), particularly in the context of the
conversational telephone speech recognition task. This ultimately would require both a revamping of acoustical feature extraction and a fresh look at the incorporation of these features into
statistical models representing speech. So far, much of our effort
has gone towards the design of new features and
experimentation
ASR
Lecture 12
this principle; individual training examples are not used directly for calculating distances but rather are used to train models
that represent statistical distributions. The Markov chains that
are at the heart of these models represent the temporal aspect
of speech sounds and can accommodate differing durations for
particular
instances.
The overall
structure
provides a consistent
Deep Neural
Network
Acoustic
Models
21
fousek@limsi.fr
Bottleneck features
Critical bands (+VTLN)
Segmentation
step: 10ms
length: 25ms
| FFT | ^2
Log
speaker based
mean and
variance
normalization
DCT
Hamming
DCT
5 layer MLP
Raw features
Log−critical band
spectrogram Hamming
Speech
PCA /
HLDA
BN
features
Grezl
and Bottle-Neck
Fousek (2008)
Fig. 1. Block diagram
of the
feature extraction with
TRAP-DCT
raw features hidden
at the MLP
Use a “bottleneck”
layer input.
to provide features for a
HMM/GMM system
performed cepstral features and are being used only as their compleDecorrelate the hidden layer using PCA (or similar)
ment.
This misfortune seems to have ended last year with the introduction of the Bottle-Neck (BN)
featuresDeep
[7].Neural
BNNetwork
features
use five-layers
ASR Lecture 12
Acoustic Models
22
Experimental comparison of tandem and bottleneck
features
MLP without MFCC
35
Three−layer MLP
Bottleneck
30
CER
Hierarchy
25
20
Multistream
C
MF
s
s
LP
ps
ps
ta
ta
ta
s
s
ps
ps
ta
ta
ta
−P
tra
rap
tra
rap
as
as
as
rap
tra
rap
tra
as
as
as
em
P−
P−
t−T
t−T
MR
MR
MR
P−
P−
t−T
t−T
nd
MR
MR
MR
wL
dc
wL
dc
A−
A−
A−
dc
wL
dc
wL
Ta
A−
A−
A−
A−
C
MLP with MFCC
25
CER
24
Three layer MLP
Bottleneck
23
21
20
Hierarchy
Multistream
22
s
s
LP
ps
ps
ta
ta
ta
s
s
ps
ps
ta
ta
ta
−P
rap
tra
rap
tra
as
as
as
CC
rap
tra
rap
tra
as
as
as
em
P−
P−
t−T
t−T
MR
MR
MR
MF
P−
P−
t−T
t−T
nd
MR
MR
MR
dc
wL
dc
wL
A−
A−
A−
dc
wL
dc
wL
Ta
A−
A−
A−
A−
Figure 2: (Top plot) Stand-alone feature performance of various speech signal representations (noted on the X-axis) when used as input to three-layer
MLP, bottleneck, hierarchical and multi-stream architectures. The down plot reports the feature performances when used in concatenation with MFCC.
(Valente et al (2011))
three-layer perceptron. The performances of the various MLP
front-ends are summarized in Figures 2 as stand-alone features
(top plot) and in concatenation with MFCC (down plot).
Figure 2 (top plot) reveals that, when a three-layer MLP
is used, none of the long temporal inputs (MRASTA, DCTTRAPS, wLP-TRAPS, and their augmented versions) outperform the conventional TANDEM-PLP nor the MFCC baseline. On the other hand, replacing the three-layer MLP with a
bottleneck or hierarchical architecture (while keeping constant
the total number of parameters) considerably reduces the error,
achieving a CER lower than the MFCC baseline. The lowest
CER is obtained by the multi-stream architecture which combines outputs of MLPs trained on long and short temporal contexts improving by 10% relative over the MFCC baseline.
Figures 2 (down plot) reports CER obtained ASR
in concateLecture
Table 5: Summary Table of CER and improvements.
Multistream
Results on a Madarin broadcast newsTANDEM
transcription
task, using
MLP
25.5 (+1%)
23.1 (+10%)
TANDEM
Hier/Bottleneck
an HMM/GMM system
MLP+MFCC 22.2 (+14%)
21.2 (+18%)
Explores many different acoustic features for the NN
6. References
Posteriorgram/bottleneck features alone (top)
Concatenating NN features with MFCCs (bottom)
[1] Hermansky H., Ellis D., and Sharma S., “Connectionist feature extraction
for conventional hmm systems.,” Proceedings of ICASSP, 2000.
[2] Morgan N. et al., “Pushing the envelope - aside,” IEEE Signal Processing
Magazine, vol. 22, no. 5, 2005.
[3] Hermansky H. and Fousek P., “Multi-resolution rasta filtering for tandembased asr.,” in Proceedings of Interspeech 2005, 2005.
[4] Schwarz P., Matejka P., and Cernocky J., “Extraction of features for automatic recognition of speech based on spectral dynamics,” in Proceedings of
TSD04, Brno, Czech Republic, September 2004, pp. 465 – 472.
12
Deep Neural Network Acoustic Models
23
Autoencoders
An autoencoder is a neural network trained to map its input
into a distributed representation from which the input can be
reconstructed
Example: single hidden layer network, with an output the
same dimension as the input, trained to reproduce the input
using squared error cost function
….
E=
1
||y
2
x||2
y: d dimension outputs
….
….
ASR Lecture 12
learned representation
x: d dimension inputs
Deep Neural Network Acoustic Models
24
Autoencoder Bottlneck (AE-BN) Features
e show that pre-trained and deeper
ments in hybrid DBN systems also
cond, we show that using AE-BN
ute improvement over a state-ofriminatively trained GMM/HMM
ovement over a hybrid DBN syshe first use of bottleneck features
MM/HMM baseline system when
eline system are also used to gene lessons learned on the 50-hour
tures on a larger 430-hour Broade that the AE-BN features offer a
GMM/HMM baseline with a WER
nation of the AE-BN and baseline
5% absolute improvement over the
al WER of 15.0%.
anized as follows. Section 2 deon 3 summarizes the experiments
AE-BN features on 50-hours of
Section 4. Results using AE-BN
st News is presented in Section 5
are discussed in Section 6. Finally,
discusses future work.
crossentropy
softmax
softmax
384
384
...
sosftsign
sigmoid
40
1024
input
(1) Deep
Belief Network
AE-BN
features
sosftsign
128
(2) Autoencoder
Fig. 1. Structure of DBN and Bottleneck Auto-Encoder. The dotted
First train boxes
a “usual”
DNN
classifying
indicate modules
that are
trained separately.acoustic input into 384
HMM states
DBN [11]. However, discriminative training of GMM/HMM systems can be thought of as a sequence-level discriminative technique,
Then trainsince
antypically
autoencoder
thatis created
mapsfromthe
predicted
output
this objective function
a set of
correct
and competing hypotheses of the training data. Since speech recogAUTO-ENCODER
vector to the
target
output
vector
nition is a sequence-level problem, usually sequence-level discriminative methods have been shown to be more powerful than frameUse the bottleneck
hidden layer in the autoencoder as features
level discriminative methods [11].
o-encoder (AE-BN) system is deFSA
move
speech
features into a canonical feature space. We
for a GMM/HMM
system
a set of input features, a DBN is
hypothesize that extracting AE-BN features before FSA and then
sing backpropagation to minimize
subsequently ASR
applying
FSA 12
would undo
some
of theNetwork
frame-level
dis- Models
Lecture
Deep
Neural
Acoustic
25
ML, and that HMM system had 2,818 tied-state, crossword tri-
Results using Autoencoder Bottlneck (AE-BN) Features
[TABLE 4] WER IN % ON ENGLISH BROADCAST NEWS.
50 H
430 H
LVCSR STAGE
GMM-HMM
BASELINE
AE-BN
GMM/HMM
BASELINE
AE-BN
FSA
24.8
20.6
20.2
17.6
+fBMMI
20.7
19.0
17.7
16.6
+BMMI
19.6
18.1
16.5
15.8
+MLLR
18.8
17.5
16.0
15.5
MODEL COMBINATION
16.4
15.0
fro
hel
tak
can
the
lea
RB
var
app
sim
ver
Hinton et al (2012)
IEEE SIGNAL PROCESSING MAGAZINE
ASR Lecture 12
Deep Neural Network Acoustic Models
26
Summary
DNN/HMM systems (hybrid systems) give a significant
improvement over GMM/HMM systems
Compared with 1990s NN/HMM systems, DNN/HMM
systems
model context-dependent tied states with a much wider output
layer
are deeper – more hidden layers
can use correlated features (e.g. FBANK)
DNN features obtained from output layer (posteriorgram) or
hidden layer (bottleneck features) give a significant reduction
in WER when appended to acoustic features (e.g. MFCCs)
ASR Lecture 12
Deep Neural Network Acoustic Models
27
Reading
G Hinton et al (Nov 2012). “Deep neural networks for acoustic modeling
in speech recognition”, IEEE Signal Processing Magazine, 29(6), 82–97.
http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=
6296526
A Mohamed et al (2012). “Unserstanding how deep belief networks
perform acoustic modelling”, Proc ICASSP-2012.
http://www.cs.toronto.edu/~asamir/papers/icassp12_dbn.pdf
N Morgan et al (Sep 2005). “Pushing the envelope – aside”, IEEE Signal
Processing Magazine, 22(5), 81–88.
http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=
1511826
F Grezl and P Fousek (2008). “Optimizing bottleneck features for
LVCSR”, Proc ICASSP–2008.
http:
//noel.feld.cvut.cz/speechlab/publications/068_icassp08.pdf
F Valente et al (2011). “Analysis and Comparison of Recent MLP
Features for LVCSR Systems”, Proc Interspeech–2011. https:
//www.sri.com/sites/default/files/publications/analysis_and_
comparison_of_recent_mlp_features_for_lvcsr_systems.pdf
ASR Lecture 12
Deep Neural Network Acoustic Models
28
Download