Discriminative and Neural Translation Models July 2, 2015 Noisy Channel MT p(e) source English Noisy Channel MT p(g | e) p(e) source English German Noisy Channel MT p(g | e) p(e) source English German decoder ⇤ e = arg max p(e | g) e p(g | e) p(e) = arg max e p(g) = arg max p(g | e) p(e) e Noisy Channel MT ⇤ e = arg max p(e | g) e p(g | e) p(e) = arg max e p(g) = arg max p(g | e) p(e) e And Beyond… ⇤ e = arg max p(e | g) e p(g | e) p(e) = arg max e p(g) = arg max p(g | e) p(e) e = arg max log p(g | e) + log p(e) e > 1 log p(g | e) = arg max 1 log p(e) e | {z } | {z } w> h(g,e) And Beyond… ⇤ e = arg max p(e | g) e p(g | e) p(e) = arg max e p(g) = arg max p(g | e) p(e) e max familiar? log p(g | e) + log p(e) Does this= arg look e > 1 log p(g | e) = arg max 1 log p(e) e | {z } | {z } w> h(g,e) MT Linear Model -log p(g | e) -log p(e) MT Linear Model -log p(g | e) ~ w -log p(e) MT Linear Model -log p(g | e) ~ w -log p(e) MT Linear Model -log p(g | e) ~ w -log p(e) MT Linear Model -log p(g | e) Improvement 1: ~ to find better change w translations ~ w g -log p(e) MT Linear Model -log p(g | e) ~ w -log p(e) MT Linear Model -log p(g | e) ~ w -log p(e) MT Linear Model -log p(g | e) ~ w -log p(e) MT Linear Model -log p(g | e) Improvement 2: Add dimensions to make points separable ~ w -log p(e) MT Linear Model ⇤ > e = arg max w h(g, e) e • Improve the modeling capacity of the noisy channel in two ways • Reorient the weight vector (parameter optimization) • Add new dimensions (new features) • Questions • What features? • How do we set the weights? h(g, e) w What Features? • Two solutions • Engineer features As a language expert, what do I think is likely to be useful for distinguishing good from bad translations? • Learn features Let data decide what features should be used to solve the problem. What are Features? • A feature function is a function from objects to a real values • A “good” feature function will map inputs that behave similarly (from the task perspective) into similar values • Example feature functions: the number of verbs in the translation? 1 if the first word is capitalized, 0 otherwise the log language model probability of a sequence horse |V | horse |V | whorse = Pvhorse horse |V | d rs e whorse = Pvhorse ho ( | {z |V | } horse |V | d ho ( rs e whorse = Pvhorse animate | size used for transportation {z |V | } horse |V | Words as Vectors whorse wdog wcat wtrain wtruck wcar wtranslation Words as Vectors whorse wtrain wtruck wdog wcat ✓ wcar wtranslation Words as Vectors whorse wdog wcat wtrain wtruck wcar ✓ wtranslation Features for MT • For MT we want to represent translations (not words) as vectors • The space of translations (not just good ones, but bad ones too!) is huge, so designing this vector space is huge • The noisy channel says: “let’s represent translations as two-dimensional vectors” • Until a year ago, MT was pretty close to that… Features for MT • Target side features [ n-gram language model ] • log p(e) • Number of words in hypothesis • Non-English character count • Source + target features • log relative frequency e|f of each rule [ log #(e,f) - log #(f) ] • log relative frequency f|e of each rule [ log #(e,f) - log #(e) ] • “lexical translation” log probability e|f of each rule [ ≈ log p • “lexical translation” log probability f|e of each rule [ ≈ log p • Other features • Count of rules/phrases used • Reordering pattern probabilities model1(e|f) ] model1(f|e) ] 29 Computing Features: Locality • Local features Feature values only depend on rule • Source-contextual features Feature values depend on rule’s target side and the entire source document • Non-local features Feature values depend on other translation decisions used Example: (n-gram) language model Why do we care? • Searching for the best solution with local and sourcecontextual features is inexpensive • • Use dynamic programming / shortest path Non-local features effectively mean that decisions you make “early” in the search affect how much each decision later will cost • NP-hard maximization problem in general • You will get to (attempt to) solve this in the lab today Parameter Learning 32 Hypothesis Space h1 Hypotheses h2 33 Hypothesis Space h1 References h2 34 Preliminaries We assume a decoder that computes: ⇤ ⇤ > he , a i = arg max w h(g, e, a) he,ai 35 Preliminaries We assume a decoder that computes: ⇤ ⇤ > he , a i = arg max w h(g, e, a) he,ai And K-best lists of, that is: { ⇤ ⇤ i=K ei , ai ⇥}i=1 th > = arg i - max w h(g, e, a) he,ai Standard algorithms exist for both of these. 36 Learning Weights • Try to match the reference translation exactly • Conditional random field • Maximize the score of the reference translations • “Average” over the different ways of producing the same translation variables • Max-margin I drank a beer I drank a beer Find the weight vector that separates the reference • translation from others by the maximal margin of the latent ich trank einsetting Bier ich variables trank ein Bier • Maximal 37 Problems • These methods give “full credit” when the model exactly produces the reference and no credit otherwise • What is the problem with this? • There are many ways to translate a sentence • What if we have multiple reference translations? • What about partial credit? 38 Cost-Sensitive Training • Assume we have a cost function that gives a score for how good/bad a translation is (ê, E) ⇥ [0, 1] • Optimize the weight vector by making reference to this function 39 K-Best List Example #2 #1 h1 #3 ~ w #6 #5#4 #7 #8 #10 #9 h2 40 K-Best List Example #2 #1 h1 0.8 < 1.0 #3 ~ w 0.6 < 0.8 #6 #5#4 0.4 < 0.6 #7 0.2 < 0.4 #8 #10 0.0 < 0.2 #9 h2 41 Training as Classification • Pairwise Ranking Optimization • Reduce training problem to binary classification with a linear model • Algorithm • For i=1 to N • Pick random pair of hypotheses (A,B) from K-best list • Use cost function to determine if is A or B better • Create ith training instance • Train binary linear classifier 42 #2 #1 h1 0.8 < 1.0 #3 0.6 < 0.8 #6 #5 #4 0.4 < 0.6 #7 0.2 < 0.4 #8 #10 0.0 < 0.2 #9 h2 h1 h2 43 #2 #1 h1 #3 #6 #5 Worse! #4 0.6 < 0.8 0.4 < 0.6 #7 0.2 < 0.4 #8 #10 0.8 < 1.0 0.0 < 0.2 #9 h2 h1 h2 44 #2 #1 h1 #3 #6 #5 Worse! #4 0.6 < 0.8 0.4 < 0.6 #7 0.2 < 0.4 #8 #10 0.8 < 1.0 0.0 < 0.2 #9 h2 h1 h2 45 #2 #1 h1 0.8 < 1.0 #3 0.6 < 0.8 #6 #5 #4 0.4 < 0.6 #7 0.2 < 0.4 #8 #10 0.0 < 0.2 #9 h2 h1 h2 46 #2 #1 h1 0.8 < 1.0 #3 0.6 < 0.8 #6 #5 #4 0.4 < 0.6 #7 #8 #10 0.2 < 0.4 Better! 0.0 < 0.2 #9 h2 h1 h2 47 #2 #1 h1 0.8 < 1.0 #3 0.6 < 0.8 #6 #5 #4 0.4 < 0.6 #7 #8 #10 0.2 < 0.4 Better! 0.0 < 0.2 #9 h2 h1 h2 48 #2 #1 h1 #3 #6 #5 Worse! #4 0.6 < 0.8 0.4 < 0.6 #7 0.2 < 0.4 #8 #10 0.8 < 1.0 0.0 < 0.2 #9 h2 h1 h2 49 #2 #1 h1 0.8 < 1.0 #3 0.6 < 0.8 #6 #5 #4 0.4 < 0.6 #7 0.2 < 0.4 #8 Better! #10 0.0 < 0.2 #9 h2 h1 h2 50 #2 #1 h1 0.8 < 1.0 #3 0.6 < 0.8 #6 #5 #4 0.4 < 0.6 #7 0.2 < 0.4 #8 #10 0.0 < 0.2 #9 h2 h1 h2 51 h1 h2 Fit a linear model 52 h1 h2 ~ w Fit a linear model 53 K-Best List Example #2 #1 h1 0.8 < 1.0 #3 0.6 < 0.8 #6 #5#4 0.4 < 0.6 #7 0.2 < 0.4 #8 0.0 < 0.2 ~ w #10 #9 h2 54 Why do this? n Discriminative 55 Questions? 56 What Features? • Two solutions • Engineer features As a language expert, what do I think is likely to be useful for distinguishing good from bad translations? • Learn features Let data decide what features should be used to solve the problem. We will use neural networks for this. Learning Features with Neural Networks • Neural networks take very “small” features (e.g., one word, one pixel, one slice of spectrogram) and compose them into representations of larger, more abstract units • Compute a representation of a sentence from words • Compute a representation of the contents of an image from pixels Bengio et al. (2003) p(e) = |e| Y i=1 p(ei | ei p(ei | ei n+1 , . . . , ei 1 ) ei 1 C ei 2 C ei 3 C = W x=x n+1 , . . . , ei 1 ) V ei softmax tanh Bengio et al. (2003) p(e) = |e| Y i=1 p(ei | ei p(ei | ei n+1 , . . . , ei 1 ) ei 1 C ei 2 C ei 3 C = W x=x n+1 , . . . , ei 1 ) V ei softmax tanh Bengio et al. (2003) p(e) = |e| Y i=1 p(ei | ei p(ei | ei n+1 , . . . , ei 1 ) = ei 1 C cei ei 2 C W ei 3 C x=x n+1 , . . . , ei 1 ) “word embedding” 1 V ei softmax tanh Bengio et al. (2003) p(e) = |e| Y i=1 p(ei | ei p(ei | ei n+1 , . . . , ei 1 ) = ei 1 C cei 1 ei 2 C cei W 2 ei 3 C x=x n+1 , . . . , ei 1 ) V ei softmax tanh Bengio et al. (2003) p(e) = |e| Y i=1 p(ei | ei p(ei | ei n+1 , . . . , ei 1 ) n+1 , . . . , ei 1 ) = ei 1 C cei 1 ei 2 C cei W 2 ei 3 C cei 3 V tanh ei softmax Bengio et al. (2003) p(e) = |e| Y i=1 p(ei | ei p(ei | ei n+1 , . . . , ei 1 ) = ei 1 C cei 1 ei 2 C cei W 2 ei 3 C x=x n+1 , . . . , ei 1 ) cei 3 V tanh ei softmax Bengio et al. (2003) p(e) = |e| Y i=1 p(ei | ei p(ei | ei n+1 , . . . , ei 1 ) ei 1 C ei 2 C ei 3 C = W x=x n+1 , . . . , ei 1 ) V ei softmax Bengio et al. (2003) p(e) = |e| Y i=1 p(ei | ei p(ei | ei n+1 , . . . , ei 1 ) ei 1 C ei 2 C ei 3 C = W x=x n+1 , . . . , ei 1 ) i Vh = etanh(Wc e) softmax tanh Bengio et al. (2003) p(e) = |e| Y i=1 p(ei | ei p(ei | ei n+1 , . . . , ei 1 ) ei 1 C ei 2 C ei 3 C = W x=x n+1 , . . . , ei 1 ) i Vh =etanh(Wc e + b) softmax tanh Bengio et al. (2003) p(e) = |e| Y i=1 p(ei | ei p(ei | ei n+1 , . . . , ei 1 ) ei 1 C ei 2 C ei 3 C = W x=x n+1 , . . . , ei 1 ) V tanh ei Bengio et al. (2003) p(e) = |e| Y i=1 p(ei | ei p(ei | ei n+1 , . . . , ei 1 ) ei 1 C ei 2 C ei 3 C = W x=x n+1 , . . . , ei 1 ) V ei softmax tanh Bengio et al. (2003) p(e) = |e| Y i=1 p(ei | ei p(ei | ei n+1 , . . . , ei 1 ) ei 1 C ei 2 C ei 3 C = W x=x n+1 , . . . , ei 1 ) V ei softmax tanh p(ei | ei n+1 , . . . , ei 1 ) ei 1 C ei 2 C ei 3 C = W x=x V ei softmax tanh exp ve>i h(ei 1 , ei 2 , ei 3 ) =P > h(e exp v i 1 , ei 2 , ei e0 e0 2V 3) p(ei | ei n+1 , . . . , ei 1 ) ei 1 C ei 2 C ei 3 C = W x=x V ei softmax tanh exp ve>i h(ei 1 , ei 2 , ei 3 ) =P > h(e exp v i 1 , ei 2 , ei e0 e0 2V 3) h is a feature function that extracts a vector of features from a trigram of words. What Features? • Two solutions • Engineer features As a language expert, what do I think is likely to be useful for distinguishing good from bad translations? • Learn features (induce features) Let data decide what features should be used to solve the problem. We will use neural networks for this. Learning Features • Neural networks are, of course, nonlinear models with particular parametric forms • But the view that they induce features is important because while tasks do different things, we often use the same features in many tasks • • Key opportunity: some tasks have a lot more data to learn features from • Multitask learning: learn the parameters for many tasks jointly The go-to “auxiliary” task for language is: language modeling Devlin et al. (2014) • Turn Bengio et al. (2003) into a translation model • Conditional model; generate the next English word conditioned on • The previous n English words you generated • The aligned source word, and its m neighbors p(e | f, a) = p(ei | ei |e| Y i=1 p(ei | ei 2 , ei 1 , fai 1 , fai , fai +1 ) 2 , ei 1 , fai 1 , fai , fai +1 ) Source alignment & its context (m=4) Target context (n=4) { { f ai 1 f ai = D D fai +1 D ei ei ei W 1 V ei C softmax 2 3 C C x=x tanh Devlin et al. (2014) Neural Translation • We know how to use neural networks to condition on unbounded history (e.g., RNNs, LSTMs) without engineering features (e.g., n-grams) • Translation models can be thought of as a “conditional” language model; i.e., a sequence model that conditions on its past + a source sentence I drank a Neural Translation • We know how to use neural networks to condition on unbounded history (e.g., RNNs, LSTMs) without engineering features (e.g., n-grams) • Translation models can be thought of as a “conditional” language model; i.e., a sequence model that conditions on its past + a source sentence I drank a Neural Translation • We know how to use neural networks to condition on unbounded history (e.g., RNNs, LSTMs) without engineering features (e.g., n-grams) • Translation models can be thought of as a “conditional” language model; i.e., a sequence model that conditions on its past + a source sentence I drank a ich trank eine Cola Neural Translation • We know how to use neural networks to condition on unbounded history (e.g., RNNs, LSTMs) without engineering features (e.g., n-grams) • Translation models can be thought of as a “conditional” language model; i.e., a sequence model that conditions on its past + a source sentence I drank a Coke ich trank eine Cola Neural MT • Two big questions: • How do we represent the sentence / pieces of the sentence? • How do we influence the prediction of the language model component? Encoding Sentences + The + sun + is “Additive” architecture = shining Encoding Sentences The sun is “Recurrent” architecture shining Sutskever et al. Encoding Sentences S VP NP DT NN VBZ JJ The sun is shining “Recursive” architecture Socher et al. Encoding Sentences The sun is shining “Convolutional” architecture Blunsom et al. Decoding Vectors • • High level decision • The input is represented as a single vector • The representation of the input evolves over time (usually computed via some “attention” mechanism) Many low-level decisions • Do we just have two inputs to the RNN at each time step? • Do we use the input to “gate” the RNN recurrences? Figure from Mei, Bansal, and Walter (2015) Figure from Mei, Bansal, and Walter (2015) Figure from Mei, Bansal, and Walter (2015) previous output word current input word previous output word current input Figure from Mei, Bansal, and Walter (2015) Kalchbrenner & Blunsom (2013) Kalchbrenner & Blunsom (2013) Kalchbrenner & Blunsom (2013) Kalchbrenner & Blunsom (2013) Kalchbrenner & Blunsom (2013) Encoder/Decoder model English-German WMT2015 Encoder/Decoder model German-English WMT2015 Questions?