Deep Neural Network Acoustic Models Steve Renals Automatic Speech Recognition – ASR Lecture 12 25 February 2016 ASR Lecture 12 Deep Neural Network Acoustic Models 1 Recap ASR Lecture 12 Deep Neural Network Acoustic Models 2 Hybrid NN/HMM Utterance "Don’t Ask" DON’T d oh ASK n t ah s Word k Subword (phone) Acoustic model (HMM) 3x39=117 phone states 8000 ~1000 hidden units 6000 freq (Hz) 1 hidden layer 4000 Speech Acoustics 2000 9x39 MFCC inputs x(t-4) x(t-3) … x(t) 0 0 … 200 x(t+3) 400 600 800 time (ms) 1000 1200 1400 x(t+4) ASR Lecture 12 Deep Neural Network Acoustic Models 3 HMM/NN vs HMM/GMM Advantages of NN: Can easily model correlated features Correlated feature vector components (eg spectral features) Input context – multiple frames of data at input More flexible than GMMs – not made of (nearly) local components); GMMs inefficient for non-linear class boundaries NNs can model multiple events in the input simultaneously – different sets of hidden units modelling each event; GMMs assume each frame generated by a single mixture component. NNs can learn richer representations and learn ‘higher-level’ features (tandem, posteriorgrams, bottleneck features) ASR Lecture 12 Deep Neural Network Acoustic Models 4 HMM/NN vs HMM/GMM Advantages of NN: Can easily model correlated features Correlated feature vector components (eg spectral features) Input context – multiple frames of data at input More flexible than GMMs – not made of (nearly) local components); GMMs inefficient for non-linear class boundaries NNs can model multiple events in the input simultaneously – different sets of hidden units modelling each event; GMMs assume each frame generated by a single mixture component. NNs can learn richer representations and learn ‘higher-level’ features (tandem, posteriorgrams, bottleneck features) Disadvantages of NN: Until ∼ 2012: Context-independent (monophone) models, weak speaker adaptation algorithms NN systems less complex than GMMs (fewer parameters): RNN – < 100k parameters, MLP – ∼ 1M parameters Computationally expensive - more difficult to parallelise training than GMM systems ASR Lecture 12 Deep Neural Network Acoustic Models 4 Deep Neural Network Acoustic Models ASR Lecture 12 Deep Neural Network Acoustic Models 5 Deep neural networks (DNNs) — Hybrid system CD Phone Outputs 12000 3–8 hidden layers Hidden units 2000 MFCC Inputs (39*9=351) ASR Lecture 12 Deep Neural Network Acoustic Models 6 DNNs — what’s new? Training multi-hidden layers directly with gradient descent is difficult — sensitive to initialisation, gradients can be very small after propagating back through several layers. Unsupervised pretraining Train a stacked restricted Boltzmann machine generative model (unsupervised), then finetune with backprop Contrastive divergence training Layer-by-layer training Successively train deeper networks, each time replacing output layer with hidden layer and new output layer Many hidden layers GPUs provide the computational power Wide output layer (context dependent phone classes) GPUs provide the computational power (Hinton et al 2012) ASR Lecture 12 Deep Neural Network Acoustic Models 7 Unsupervised pretraining DBN-DNN RBM RBM GRBM Copy Copy DBN W3 W2 W1 W4 = 0 W3 W3 W2 W2 W1 W1 T T T [FIG1] The sequence of operations used to create a DBN with three hidden layers and to convert it to a pretrained DBN-DNN. First, a GRBM is trained to model a window of frames of real-valued acoustic coefficients. Then the states of the binary hidden units of the GRBM are used as data for training an RBM. This is repeated to create as many hidden layers as desired. Then the stack of RBMs is converted to a single generative model, a DBN, by replacing the undirected connections of the lower level RBMs by top-down, directed connections. Finally, a pretrained DBN-DNN is created by adding a “softmax” output layer that contains one unit for each possible state of each HMM. The DBN-DNN is then discriminatively trained to predict the HMM state corresponding to the central frame of the input window in a forced alignment. Hinton et al (2012) INTERFACING A DNN WITH AN HMM After it has been discriminatively fine-tuned, a DNN outputs probabilities of the form p (HMMstate ; AcousticInput) . But to compute a Viterbi alignment or to run the forward-backward Lecture 12 algorithm within the HMM framework, weASR require the likeli- ing point for developing a new approach, especially one that requires a challenging amount of computation. Mohamed et. al. [12] showed that a DBN-DNN acoustic model outperformed the best published recognition results on DeepatNeural Network Acoustic Models TIMIT about the same time as Sainath et. al. [23] achieved a 8 Example: hybrid HMM/DNN phone recognition (TIMIT) Train a ‘baseline’ three state monophone HMM/GMM system (61 phones, 3 state HMMs) and Viterbi align to provide DNN training targets (time state alignment) The HMM/DNN system uses the same set of states as the HMM/GMM system — DNN has 183 (61*3) outputs Hidden layers — many experiments, exact sizes not highly critical 3–8 hidden layers 1024–3072 units per hidden layer Multiple hidden layers always work better than one hidden layer Pretraining always results in lower error rates Best systems have lower phone error rate than best HMM/GMM systems (using state-of-the-art techniques such as discriminative training, speaker adaptive training) ASR Lecture 12 Deep Neural Network Acoustic Models 9 Acoustic features for NN acoustic models GMMs: filter bank features (spectral domain) not used as they are strongly correlated with each other – would either require full covariance matrix Gaussians many diagonal covariance Gaussians DNNs do not require the components of the feature vector to be uncorrelated Can directly use multiple frames of input context (this has been done in NN/HMM systems since 1990!) Can potentially use feature vectors with correlated components (e.g. filter banks) Experiments indicate that filter bank features result in greater accuracy than MFCCs ASR Lecture 12 Deep Neural Network Acoustic Models 10 sionality of the input is not the crucial property (see p. 3). TIMIT phone error rates: effect of depth and feature type 25 fbank−hid−2048−15fr−core fbank−hid−2048−15fr−dev mfcc−hid−3072−16fr−core mfcc−hid−3072−16fr−dev Phone error rate (PER) 24 23 22 21 20 19 18 1 2 3 4 5 6 7 8 Number of layers (Mohamed et al (2012)) Fig. 2. PER as a function of the number of layers. ASR Lecture 12 Deep Neural Network Acoustic Models 11 Visualising neural networks How to visualise NN layers? “t-SNE” (stochastic neighbour embedding using t-distribution) projects high dimension vectors (e.g. the values of all the units in a layer) into 2 dimensions t-SNE projection aims to keep points that are close in high dimensions close in 2 dimensions by comparing distributions over pairwise distances between the high dimensional and 2 dimensional spaces – the optimisation is over the positions of points in the 2-d space ASR Lecture 12 Deep Neural Network Acoustic Models 12 −100 −80 −60 −40 −20 0 20 40 60 80 100 a features for 6 speakers. Crosses refer to one utterance and circles ret-SNE 2-D map ofcolours the 1strefer layerto of the fine-tuned in fer to theFig. other5.one, while different different speak- hidd activity vectors using fbank inputs. ers. We removed the data points of the other 18 speakers to make the map less cluttered. FeatureFig. vector (input layer): t-SNE visualisation 3. t-SNE 2-D map of fbank feature vectors 100 150 100 80 80 60 100 60 40 40 20 50 20 0 0 0 −20 −20 −40 −50 −40 −60 −60 −80 −100 −80 −100 −100 −100 −100 −80 −60 −40 −20 0 20 40 60 80 100 −150 MFCC −100 −80 −80 −60 −60 −40 −40 −20 −20 0 0 20 20 40 40 60 60 80 80 100 100 F a Fig. 6. t-SNE 2-D map of the 8th layer of the fine-tuned hidd activity usingmap fbank inputs.feature vectors Fig. 3. vectors t-SNE 2-D of fbank To refute the hypothesis that fbank features yield lower PE because of their higher dimensionality, we consider dct featur which are the same as fbank features except that they are tra FBANK (Mohamed et al (2012)) Visualisation of 2 utterances (cross and circle) spoken by 6 speakers (colours) MFCCs are more scattered than FBANK FBANK has more local structure than MFCCs Fig. 4. t-SNE 2-D map of MFCC feature vectors MFCC vectors tend to be scattered all over the space as they have decorrelated elements while fbank feature vectors have stronger similarities and are often aligned between different speakers for some 100 80 60 40 20 0 −20 ASR Lecture 12 Deep Neural Network Acoustic Models 13 and 8 show the 1st and 8th layer features of fine-tuned DBNs trained with fbank and MFCC respectively. As we go higher in the network, hidden activity vectors from different speakers for the same segment align in both the MFCC and fbank cases but the alignment is stronger in the fbank case. ng), we used SA utterances from the TIMIT core test set speakers. These are the two utterances that were spoken by all 24 different peakers. Figures 3 and 4 show visualizations of fbank and MFCC eatures for 6 speakers. Crosses refer to one utterance and circles reer to the other one, while different colours refer to different speakrs. We removed the data points of the other 18 speakers to make the map less cluttered. 100 First hidden layer: t-SNE visualisation 100 80 80 150 60 60 40 40 100 20 20 0 50 this claim, we padded the MFCC vector with random noise to b the same dimensionality as the dct features and then used them network training (MFCC+noise row in table 2). The MFCC pe mance was degraded by padding with noise. So it is not the hi dimensionality that matters but rather how the discriminant info tion is distributed over these dimensions. Table 2. The PER deep nets using different features 0 −20 Feature fbank dct MFCC MFCC+noise −20 0 −40 −40 −60 −60 −50 −80 −100 −150 −100 −150 −100 Dim 123 123 39 123 1lay 23.5% 26.0% 24.3% 26.3% 2lay 22.6% 23.8% 23.7% 24.3% 3lay 22.7% 24.6% 23.8% 25.1% −80 −100 −50 0 50 100 −100 −150 MFCC −100 −50 5. CONCLUSIONS 0 50 100 FBANK A DBN acoustic model has three main properties: It is a n −80 −60 −40 −20 0 20 40 60 80 100 network, it has many layers of non-linear features, and it is Fig. 7. t-SNE 2-D map of the 1st layer of the fine-tuned hidden Fig. 5. t-SNE 2-D of the 1stmodel. layer of trained as map a generative In the thisfine-tuned paper we hidden investigated activity vectors using MFCC inputs. activity vectors using fbank each of these threeinputs. properties contributes to good phone recogn Fig. 3. t-SNE 2-D map of fbank feature vectors on TIMIT. Additionally, we examined different types of input 150 resentation for DBNs by comparing recognition rates and als visualising the similarity structure of the input vectors and the den activity vectors. We concluded that log filter-bank feature 100 the most suitable for DBNs because they better utilize the abili the neural net to discover higher-order structure in the input dat (Mohamed et al (2012)) Visualisation of 2 utterances (cross and circle) spoken by 6 speakers (colours) Hidden layer vectors start to align more between speakers for FBANK 100 100 80 60 80 60 40 50 40 20 20 0 0 0 ASR Lecture 12 6. REFERENCES H. Bourlard and N. Morgan, Connectionist Speech Recogni Deep[1] Neural Network Acoustic Models 14 −150 −80 −60 −40 −20 0 20 40 60 80 100 A DBN acoustic model has three main properties: It is a ne Fig. 5. t-SNE 2-Ditmap the layers 1st layer the fine-tuned hidden network, has of many of of non-linear features, and it is p Fig. 7. t-SNE 2-D map of the 1st layer of the fine-tuned hidden activity vectors inputs.model. In this paper we investigated h trainedusing as a fbank generative activity vectors using MFCC inputs. each of these three properties contributes to good phone recogni Fig. 3. t-SNE 2-D map of fbank feature vectors on TIMIT. Additionally, we examined different types of input r 150 resentation for DBNs by comparing recognition rates and also visualising the similarity structure of the input vectors and the h den activity vectors. We concluded that log filter-bank features 100 the most suitable for DBNs because they better utilize the ability the neural net to discover higher-order structure in the input data −100 Eighth hidden layer: t-SNE visualisation 100 100 80 60 80 60 40 40 50 20 6. REFERENCES 20 0 0 0 −20 −20 −40 −50 −60 −60 −80 −80 −100 [1] H. Bourlard and N. Morgan, Connectionist Speech Recogniti A Hybrid Approach, Kluwer Academic Publishers, 1993. −40 −100 −100 −100 −80 −80 −60 −60 −40 −40 −20 −20 0 0 20 20 40 MFCC 40 60 60 80 80 100 100 −100 −100 [2] H. Hermansky, D. Ellis, and S. Sharma, “Tandem connectio feature extraction for conventional HMM systems,” in ICAS 2000, pp. 1635–1638. [3] G. E. Hinton, S. Osindero, and Y. W. Teh, “A fast learning al rithm for deep belief nets,” Neural Computation, vol. 18, no pp. 1527–1554, 2006. −80 −60 −40 −20 0 20 40 60 80 100 FBANK (Mohamed et al (2012)) Visualisation of 2 utterances (cross and circle) spoken by 6 speakers (colours) In the final hidden layer, the hidden layer outputs for the same phone are well-aligned across speakers for both MFCC and FBANK – but stronger for FBANK [4] A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling Fig. 8. t-SNE 2-D map of the 8th layer of the fine-tuned hidden Fig. 6. t-SNE 2-D map of the 8th layer of the fine-tuned hidden ing deep belief networks,” IEEE Transactions on Audio, Spee activity using MFCC inputs. activity vectors using fbank inputs. Fig. 4. vectors t-SNE 2-D map of MFCC feature vectors and Language Processing, 2011. MFCC vectors tend to be scattered all over the space as they have To refute the hypothesis that fbank features yield lower PER [5]their G. Dahl, Yu, L. Deng, and A. Acero,dct“Context-depend decorrelated elements fbank cosine featuretransform, vectors have stronger sim- decorbecause of higherD.dimensionality, we consider features, formed using while the discrete which encourages deep features neural networks for they largeare vocabulary ilarities and are elements. often aligned different some which are thepre-trained same as fbank except that trans- spe related We between rank-order the dctspeakers featuresfor from lower-order recognition,” IEEE Transactions on Audio, Speech, and L (slow-moving) features to higher-order ones. For the generative preguage Processing, 2011. training phase, the dct features are disadvantaged because they are [6] T. N. Sainath, B. Kingsbury, B. Ramabhadran, P. Fousek, P. not as strongly structured as the fbank features. To avoid a convak, and A. Mohamed, “Making deep belief networks effec founding effect, we skipped pre-training and performed the comparfor large vocabulary continuous speech recognition,” in AS ison using only the fine-tuning from random initial weights. Table 2 2011. shows PER for fbank, dct, and MFCC inputs (11 input frames and 1024 hidden units per layer) in 1, 2, and 3 hidden-layer neural net[7] J.B. Allen, “How do humans process and recognize speec works. dct features are worse than both fbank features and MFCC IEEENetwork Trans. Speech Audio Processing, vol. 2, no. 4,15 pp. 5 ASR Lecture 12 causesDeep Neural Acoustic Models features. This prompts us to ask why a lossless transformation Visualising neural networks How to visualise NN layers? “t-SNE” (stochastic neighbour embedding using t-distribution) projects high dimension vectors (e.g. the values of all the units in a layer) into 2 dimensions t-SNE projection aims to keep points that are close in high dimensions close in 2 dimensions by comparing distributions over pairwise distances between the high dimensional and 2 dimensional spaces – the optimisation is over the positions of points in the 2-d space Are the differences due to FBANK being higher dimension (41 × 3 = 123) than MFCC (13 × 3 = 39)? NO! Using higher dimension MFCCs, or just adding noise to MFCCs results in higher error rate Why? – In FBANK the useful information is distributed over all the features; in MFCC it is concentrated in the first few. ASR Lecture 12 Deep Neural Network Acoustic Models 16 Example: hybrid HMM/DNN large vocabulary conversational speech recognition (Switchboard) Recognition of American English conversational telephone speech (Switchboard) Baseline context-dependent HMM/GMM system 9,304 tied states Discriminatively trained (BMMI — similar to MPE) 39-dimension PLP (+ derivatives) features Trained on 309 hours of speech Hybrid HMM/DNN system Context-dependent — 9304 output units obtained from Viterbi alignment of HMM/GMM system 7 hidden layers, 2048 units per layer DNN-based system results in significant word error rate reduction compared with GMM-based system Pretraining not necessary on larger tasks (empirical result) ASR Lecture 12 Deep Neural Network Acoustic Models 17 multiple epochs as reported in [45 has also been found effective for th convex network” [51] and “deep st pretraining is accomplished by co no generative models. Purely discriminative training o dom initial weights works much be provid weigh [TABLE 3] A COMPARISON OF THE PERCENTAGE WERs USING DNN-HMMs AND amoun GMM-HMMs ON FIVE DIFFERENT LARGE VOCABULARY TASKS. availab HOURS OF GMM-HMM GMM-HMM trainin TASK TRAINING DATA DNN-HMM WITH SAME DATA WITH MORE DATA ately [ SWITCHBOARD (TEST SET 1) 309 18.5 27.4 18.6 (2,000 H) erative SWITCHBOARD (TEST SET 2) 309 16.1 23.6 17.1 (2,000 H) ENGLISH BROADCAST NEWS 50 17.5 18.8 test pe BING VOICE SEARCH signifi (SENTENCE ERROR RATES) 24 30.4 36.2 Lay GOOGLE VOICE INPUT 5,870 12.3 16.0 (22 5,870 H) traini YOUTUBE 1,400 47.6 52.3 using DBN-DNN ACOUSTIC MODELS ON LVCSR TASKS Table 3 summarizes the acoustic modeling results described above. It shows that DNN-HMMs consistently outperform GMM-HMMs that are trained on the same amount of data, sometimes by a large margin. For some tasks, DNN-HMMs also outperform GMM-HMMs that are trained on much more data. DNN vs GMM on large vocabulary tasks (Experiments from 2012) (Hinton et al (2012)) IEEE SIGNAL PROCESSING MAGAZINE [92] NOVEMBER 2012 ASR Lecture 12 Deep Neural Network Acoustic Models 18 Neural Network Features ASR Lecture 12 Deep Neural Network Acoustic Models 19 Tandem features (posteriorgrams) Use NN probability estimates as an additional input feature stream in an HMM/GMM system —- (Tandem features (i.e. NN + acoustics), posteriorgrams) Advantages of tandem features can be estimated using a large amount of temporal context (eg up to ±25 frames) encode phone discrimination information only weakly correlated with PLP or MFCC features Tandem features: reduce dimensionality of NN outputs using PCA, then concatenate with acoustic features (e.g. MFCCs) PCA also decorrelates feature vector components – important for GMM-based systems ASR Lecture 12 Deep Neural Network Acoustic Models 20 Tandem features PLP Analysis One Frame of PLP Features Concatenate PCA Dimensionality Reduction PLP Net Critical Band Energy Analysis GMHMM Back-End One Frame of MLP-Based Features Nine Frames of PLP Features Speech Input One Frame of Augmented Features Single-Frame Posterior Combination Log 51 Frames of Log-Critical Band Energies HATs Processing [FIG1] Posterior-based feature generation system. Each posterior stream is created by feeding a trained multilayer perceptron (MLP) with features that have different temporal and spectral extent. The “PLP Net” is trained to generate phone posterior estimates given roughly 100 ms of telephone bandwidth speech after being processed by PLP analysis over nine frames. HATs processing is trained for the same goal given 500 ms of log-critical band energies. The two streams of posteriors are combined (in a weighted sum where each weight is a scaled version of local stream entropy) and transformed as shown to augment the more traditional PLP features. The augmented feature vector is used as an observation by the Gaussian mixture hidden Markov model (GMHMM) system. Morgan et al (2005) recognition systems (SRSs), particularly in the context of the conversational telephone speech recognition task. This ultimately would require both a revamping of acoustical feature extraction and a fresh look at the incorporation of these features into statistical models representing speech. So far, much of our effort has gone towards the design of new features and experimentation ASR Lecture 12 this principle; individual training examples are not used directly for calculating distances but rather are used to train models that represent statistical distributions. The Markov chains that are at the heart of these models represent the temporal aspect of speech sounds and can accommodate differing durations for particular instances. The overall structure provides a consistent Deep Neural Network Acoustic Models 21 fousek@limsi.fr Bottleneck features Critical bands (+VTLN) Segmentation step: 10ms length: 25ms | FFT | ^2 Log speaker based mean and variance normalization DCT Hamming DCT 5 layer MLP Raw features Log−critical band spectrogram Hamming Speech PCA / HLDA BN features Grezl and Bottle-Neck Fousek (2008) Fig. 1. Block diagram of the feature extraction with TRAP-DCT raw features hidden at the MLP Use a “bottleneck” layer input. to provide features for a HMM/GMM system performed cepstral features and are being used only as their compleDecorrelate the hidden layer using PCA (or similar) ment. This misfortune seems to have ended last year with the introduction of the Bottle-Neck (BN) featuresDeep [7].Neural BNNetwork features use five-layers ASR Lecture 12 Acoustic Models 22 Experimental comparison of tandem and bottleneck features MLP without MFCC 35 Three−layer MLP Bottleneck 30 CER Hierarchy 25 20 Multistream C MF s s LP ps ps ta ta ta s s ps ps ta ta ta −P tra rap tra rap as as as rap tra rap tra as as as em P− P− t−T t−T MR MR MR P− P− t−T t−T nd MR MR MR wL dc wL dc A− A− A− dc wL dc wL Ta A− A− A− A− C MLP with MFCC 25 CER 24 Three layer MLP Bottleneck 23 21 20 Hierarchy Multistream 22 s s LP ps ps ta ta ta s s ps ps ta ta ta −P rap tra rap tra as as as CC rap tra rap tra as as as em P− P− t−T t−T MR MR MR MF P− P− t−T t−T nd MR MR MR dc wL dc wL A− A− A− dc wL dc wL Ta A− A− A− A− Figure 2: (Top plot) Stand-alone feature performance of various speech signal representations (noted on the X-axis) when used as input to three-layer MLP, bottleneck, hierarchical and multi-stream architectures. The down plot reports the feature performances when used in concatenation with MFCC. (Valente et al (2011)) three-layer perceptron. The performances of the various MLP front-ends are summarized in Figures 2 as stand-alone features (top plot) and in concatenation with MFCC (down plot). Figure 2 (top plot) reveals that, when a three-layer MLP is used, none of the long temporal inputs (MRASTA, DCTTRAPS, wLP-TRAPS, and their augmented versions) outperform the conventional TANDEM-PLP nor the MFCC baseline. On the other hand, replacing the three-layer MLP with a bottleneck or hierarchical architecture (while keeping constant the total number of parameters) considerably reduces the error, achieving a CER lower than the MFCC baseline. The lowest CER is obtained by the multi-stream architecture which combines outputs of MLPs trained on long and short temporal contexts improving by 10% relative over the MFCC baseline. Figures 2 (down plot) reports CER obtained ASR in concateLecture Table 5: Summary Table of CER and improvements. Multistream Results on a Madarin broadcast newsTANDEM transcription task, using MLP 25.5 (+1%) 23.1 (+10%) TANDEM Hier/Bottleneck an HMM/GMM system MLP+MFCC 22.2 (+14%) 21.2 (+18%) Explores many different acoustic features for the NN 6. References Posteriorgram/bottleneck features alone (top) Concatenating NN features with MFCCs (bottom) [1] Hermansky H., Ellis D., and Sharma S., “Connectionist feature extraction for conventional hmm systems.,” Proceedings of ICASSP, 2000. [2] Morgan N. et al., “Pushing the envelope - aside,” IEEE Signal Processing Magazine, vol. 22, no. 5, 2005. [3] Hermansky H. and Fousek P., “Multi-resolution rasta filtering for tandembased asr.,” in Proceedings of Interspeech 2005, 2005. [4] Schwarz P., Matejka P., and Cernocky J., “Extraction of features for automatic recognition of speech based on spectral dynamics,” in Proceedings of TSD04, Brno, Czech Republic, September 2004, pp. 465 – 472. 12 Deep Neural Network Acoustic Models 23 Autoencoders An autoencoder is a neural network trained to map its input into a distributed representation from which the input can be reconstructed Example: single hidden layer network, with an output the same dimension as the input, trained to reproduce the input using squared error cost function …. E= 1 ||y 2 x||2 y: d dimension outputs …. …. ASR Lecture 12 learned representation x: d dimension inputs Deep Neural Network Acoustic Models 24 Autoencoder Bottlneck (AE-BN) Features e show that pre-trained and deeper ments in hybrid DBN systems also cond, we show that using AE-BN ute improvement over a state-ofriminatively trained GMM/HMM ovement over a hybrid DBN syshe first use of bottleneck features MM/HMM baseline system when eline system are also used to gene lessons learned on the 50-hour tures on a larger 430-hour Broade that the AE-BN features offer a GMM/HMM baseline with a WER nation of the AE-BN and baseline 5% absolute improvement over the al WER of 15.0%. anized as follows. Section 2 deon 3 summarizes the experiments AE-BN features on 50-hours of Section 4. Results using AE-BN st News is presented in Section 5 are discussed in Section 6. Finally, discusses future work. crossentropy softmax softmax 384 384 ... sosftsign sigmoid 40 1024 input (1) Deep Belief Network AE-BN features sosftsign 128 (2) Autoencoder Fig. 1. Structure of DBN and Bottleneck Auto-Encoder. The dotted First train boxes a “usual” DNN classifying indicate modules that are trained separately.acoustic input into 384 HMM states DBN [11]. However, discriminative training of GMM/HMM systems can be thought of as a sequence-level discriminative technique, Then trainsince antypically autoencoder thatis created mapsfromthe predicted output this objective function a set of correct and competing hypotheses of the training data. Since speech recogAUTO-ENCODER vector to the target output vector nition is a sequence-level problem, usually sequence-level discriminative methods have been shown to be more powerful than frameUse the bottleneck hidden layer in the autoencoder as features level discriminative methods [11]. o-encoder (AE-BN) system is deFSA move speech features into a canonical feature space. We for a GMM/HMM system a set of input features, a DBN is hypothesize that extracting AE-BN features before FSA and then sing backpropagation to minimize subsequently ASR applying FSA 12 would undo some of theNetwork frame-level dis- Models Lecture Deep Neural Acoustic 25 ML, and that HMM system had 2,818 tied-state, crossword tri- Results using Autoencoder Bottlneck (AE-BN) Features [TABLE 4] WER IN % ON ENGLISH BROADCAST NEWS. 50 H 430 H LVCSR STAGE GMM-HMM BASELINE AE-BN GMM/HMM BASELINE AE-BN FSA 24.8 20.6 20.2 17.6 +fBMMI 20.7 19.0 17.7 16.6 +BMMI 19.6 18.1 16.5 15.8 +MLLR 18.8 17.5 16.0 15.5 MODEL COMBINATION 16.4 15.0 fro hel tak can the lea RB var app sim ver Hinton et al (2012) IEEE SIGNAL PROCESSING MAGAZINE ASR Lecture 12 Deep Neural Network Acoustic Models 26 Summary DNN/HMM systems (hybrid systems) give a significant improvement over GMM/HMM systems Compared with 1990s NN/HMM systems, DNN/HMM systems model context-dependent tied states with a much wider output layer are deeper – more hidden layers can use correlated features (e.g. FBANK) DNN features obtained from output layer (posteriorgram) or hidden layer (bottleneck features) give a significant reduction in WER when appended to acoustic features (e.g. MFCCs) ASR Lecture 12 Deep Neural Network Acoustic Models 27 Reading G Hinton et al (Nov 2012). “Deep neural networks for acoustic modeling in speech recognition”, IEEE Signal Processing Magazine, 29(6), 82–97. http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber= 6296526 A Mohamed et al (2012). “Unserstanding how deep belief networks perform acoustic modelling”, Proc ICASSP-2012. http://www.cs.toronto.edu/~asamir/papers/icassp12_dbn.pdf N Morgan et al (Sep 2005). “Pushing the envelope – aside”, IEEE Signal Processing Magazine, 22(5), 81–88. http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber= 1511826 F Grezl and P Fousek (2008). “Optimizing bottleneck features for LVCSR”, Proc ICASSP–2008. http: //noel.feld.cvut.cz/speechlab/publications/068_icassp08.pdf F Valente et al (2011). “Analysis and Comparison of Recent MLP Features for LVCSR Systems”, Proc Interspeech–2011. https: //www.sri.com/sites/default/files/publications/analysis_and_ comparison_of_recent_mlp_features_for_lvcsr_systems.pdf ASR Lecture 12 Deep Neural Network Acoustic Models 28