Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group Overview • Language Modelling • Machine Translation Overview • Language Modelling • Machine Translation Language Modelling Problem • Aim is to calculate the probability of a sequence (sentence) P(X) • Can be decomposed into product of conditional probabilities of tokens (works): • In practice, only finite content used N-Gram Language Model • N-Grams estimate word conditional probabilities via counting: • Sparse (alleviated by back-off, but not entirely) • Doesn’t exploit word similarity • Finite Context Neural Network Language Model Y. Bengio et al., JMLR’03 Limitation of Neural Network Language Model • Sparsity – Solved • World Similarity – Solved • Finite Context – Not • Computational Complexity - Softmax Recurrent Neural Network Language Model [X. Liu, et al.] Wall Street Journal Results – T. Mikolov Google 2010 Limitation of RNN Language Model • Sparsity – Solved! • World Similarity -> Sentence Similarity – Solved! • Finite Context – Solved? Not quite… • Still Computationally Complex Softmax Lattice Rescoring with RNNs • Application of RNNs to lattices expands space • Lattice is expanded to a prefix tree or N-best list • Impractical to apply to large lattices • Approximate Lattice Expansion – expand if: • N-gram history is different • RNN history vector distance exceeds a threshold Overview • Language Modeling • Machine Translation Machine Translation Task • Translate an Source Sentence E into a target sentence F • Can be formulated in Noisy-Channel Framework: E’ = argmaxE[P(F|E)] = argmaxE[P(E|F)*P(F)] • P(F) is just a language model – need to estimate P(E|F). Previous Approaches: Word Alignment W. Byrne, 4F11 • Use IBM Models 1-5 to create initial word alignments of increasing complexity and accuracy from sentence pairs. • Make conditional independence assumptions to separate out sentence length, alignment and translation models. • Bootstrap using simpler models to initialize more complex models. Previous Approaches: Phrase Based SMT W. Byrne, 4F11 • Using IBM world alignments create phrase alignments and a phrase translation model. • Parameters estimated by Maximum Likelihood or EM. • Apply Synchronous Context Free Grammar to learn hierarchical rules over phrases. Problems with Previous Approaches • Highly Memory Intensive • Initial alignment makes conditional independence assumption • Word and Phrase translation models only count co-occurrences of surface form – don’t take word similarity into account • Highly non-trivial to decode hierarchical phrase based translation • word alignments + lexical reordering model • language model • phrase translations • parse a synchronous context free grammar over the text – components are very different from one another. Neural Machine Translation • The translation problem is expressed as a probability P(F|E) • Equivalent to P(fn, fn-1, …, f0 | em, em-1, …, e0) -> a sequence conditioned on another sequence. • Create an RNN architecture where the output of on RNN (decoder) is conditioned on another RNN (encoder). • We can connect them using a joint alignment and translation mechanism. • Results in a single gestalt Machine Translation model which can generate candidate translations. Bi-Directional RNNs Neural Machine Translation: Encoder h0 e0 h1 e1 … … h j … … … … … … … ej hN eN • Can be pre-trained as a Bi Directional RNN language model Neural Machine Translation: Decoder <S> f0 s0 f1 s1 … … s t … ft … … … FM= </S> sM • ft is produced by sampling the discrete probability produced by softmax output layer. • Can be pre-trained as a RNN language model Neural Machine Translation: Joint Alignment <S> f0 s0 s1 … s t … ft … … … sM Ct = ∑atjhj a t,1:N zj = W ∙ tanh(V ∙ st-1 + U ∙ hj) st-1 … f1 z0 z1 zj h0 h1 … … hj zN … … … … hN fM Neural Machine Translation: Features • End-to-end differentiable, trained using SGD with cross-entropy error function. • Encoder and Decoder learn to represent source and target sentences in a compact, distributed manner • Does not make conditional independence assumptions to separate out translation model, alignment model, re-ordering model, etc… • Does not pre-align words by bootstrapping from simpler models. • Learns translation and joint alignment in a semantic space, not over surface forms. • Conceptually easy to decode – complexity similar to speech processing, not SMT. • Fewer Parameters – more memory efficient. NMT BLEU results on English to French Translation D. Bahdanau, K. Cho, Y. Bengio. Neural Machine Translation by Jointly Learning to Align and Translate. Conclusion • RNNs and LSTM RNNs have been widely applied to a large. • State of the art in language modelling • Competitive performance on new tasks. • Quickly evolving. Biliography • W. Byrne, Engineering Part IIB: Module 4F11 Speech and Language Processing. Lecture 12. http://mi.eng.cam.ac.uk/~pcw/local/4F11/4F11_2014_lect12.pdf • D. Bahdanau, K. Cho, Y. Bengio. Neural Machine Translation by Jointly Learning to Align and Translate. 2014. • Y. Bengio, et al., “A neural probabilistic language model”. Journal of Machine Learning Research, No. 3 (2003) • X. Liu, et al. “Efficient Lattice Rescoring using Recurrent Neural Network Language Models”. In: Proceedings of IEEE ICASSP 2014. • T. Mikolov. “Statistical Language Models Based on Neural Networks” (2012) PhD Thesis. Brno University of Technology, Faculty of Information Technology, Department Of Computer Graphics and Multimedia.