Neural Network Language Model

advertisement
Application of RNNs to Language
Processing
Andrey Malinin, Shixiang Gu
CUED Division F Speech Group
Overview
• Language Modelling
• Machine Translation
Overview
• Language Modelling
• Machine Translation
Language Modelling Problem
• Aim is to calculate the probability of a sequence (sentence) P(X)
• Can be decomposed into product of conditional probabilities of tokens
(works):
• In practice, only finite content used
N-Gram Language Model
• N-Grams estimate word conditional probabilities via counting:
• Sparse (alleviated by back-off, but not entirely)
• Doesn’t exploit word similarity
• Finite Context
Neural Network Language Model
Y. Bengio et al., JMLR’03
Limitation of Neural Network Language Model
• Sparsity – Solved
• World Similarity – Solved
• Finite Context – Not
• Computational Complexity - Softmax
Recurrent Neural Network Language Model
[X. Liu, et al.]
Wall Street Journal Results – T. Mikolov Google 2010
Limitation of RNN Language Model
• Sparsity – Solved!
• World Similarity -> Sentence Similarity – Solved!
• Finite Context – Solved? Not quite…
• Still Computationally Complex Softmax
Lattice Rescoring with RNNs
• Application of RNNs to lattices expands space
• Lattice is expanded to a prefix tree or N-best list
• Impractical to apply to large lattices
• Approximate Lattice Expansion – expand if:
• N-gram history is different
• RNN history vector distance exceeds a threshold
Overview
• Language Modeling
• Machine Translation
Machine Translation Task
• Translate an Source Sentence E into a target sentence F
• Can be formulated in Noisy-Channel Framework:
E’ = argmaxE[P(F|E)] = argmaxE[P(E|F)*P(F)]
• P(F) is just a language model – need to estimate P(E|F).
Previous Approaches: Word Alignment
W. Byrne, 4F11
• Use IBM Models 1-5 to create initial word alignments of increasing
complexity and accuracy from sentence pairs.
• Make conditional independence assumptions to separate out sentence
length, alignment and translation models.
• Bootstrap using simpler models to initialize more complex models.
Previous Approaches: Phrase Based SMT
W. Byrne, 4F11
• Using IBM world alignments create phrase alignments and a phrase
translation model.
• Parameters estimated by Maximum Likelihood or EM.
• Apply Synchronous Context Free Grammar to learn hierarchical rules
over phrases.
Problems with Previous Approaches
• Highly Memory Intensive
• Initial alignment makes conditional independence assumption
• Word and Phrase translation models only count co-occurrences of
surface form – don’t take word similarity into account
• Highly non-trivial to decode hierarchical phrase based translation
• word alignments + lexical reordering model
• language model
• phrase translations
• parse a synchronous context free grammar over the text – components are
very different from one another.
Neural Machine Translation
• The translation problem is expressed as a probability P(F|E)
• Equivalent to P(fn, fn-1, …, f0 | em, em-1, …, e0) -> a sequence conditioned
on another sequence.
• Create an RNN architecture where the output of on RNN (decoder) is
conditioned on another RNN (encoder).
• We can connect them using a joint alignment and translation
mechanism.
• Results in a single gestalt Machine Translation model which can
generate candidate translations.
Bi-Directional RNNs
Neural Machine Translation: Encoder
h0
e0
h1
e1
…
… h
j
…
…
…
…
…
…
…
ej
hN
eN
• Can be pre-trained as a Bi Directional RNN language model
Neural Machine Translation: Decoder
<S>
f0
s0
f1
s1
…
… s
t
…
ft
…
…
…
FM= </S>
sM
• ft is produced by sampling the discrete probability produced by softmax
output layer.
• Can be pre-trained as a RNN language model
Neural Machine Translation: Joint Alignment
<S>
f0
s0
s1
… s
t
…
ft
…
…
…
sM
Ct = ∑atjhj
a t,1:N
zj = W ∙ tanh(V ∙ st-1 + U ∙ hj)
st-1
…
f1
z0
z1
zj
h0
h1
…
… hj
zN
…
…
…
…
hN
fM
Neural Machine Translation: Features
• End-to-end differentiable, trained using SGD with cross-entropy error function.
• Encoder and Decoder learn to represent source and target sentences in a
compact, distributed manner
• Does not make conditional independence assumptions to separate out
translation model, alignment model, re-ordering model, etc…
• Does not pre-align words by bootstrapping from simpler models.
• Learns translation and joint alignment in a semantic space, not over surface
forms.
• Conceptually easy to decode – complexity similar to speech processing, not
SMT.
• Fewer Parameters – more memory efficient.
NMT BLEU results on English to French Translation
D. Bahdanau, K. Cho, Y. Bengio. Neural Machine Translation by Jointly Learning to Align and Translate.
Conclusion
• RNNs and LSTM RNNs have been widely applied to a large.
• State of the art in language modelling
• Competitive performance on new tasks.
• Quickly evolving.
Biliography
• W. Byrne, Engineering Part IIB: Module 4F11 Speech and Language
Processing. Lecture 12.
http://mi.eng.cam.ac.uk/~pcw/local/4F11/4F11_2014_lect12.pdf
• D. Bahdanau, K. Cho, Y. Bengio. Neural Machine Translation by Jointly
Learning to Align and Translate. 2014.
• Y. Bengio, et al., “A neural probabilistic language model”. Journal of Machine
Learning Research, No. 3 (2003)
• X. Liu, et al. “Efficient Lattice Rescoring using Recurrent Neural Network
Language Models”. In: Proceedings of IEEE ICASSP 2014.
• T. Mikolov. “Statistical Language Models Based on Neural Networks” (2012)
PhD Thesis. Brno University of Technology, Faculty of Information Technology,
Department Of Computer Graphics and Multimedia.
Download