>>Chris Quirk: Welcome, everybody. It’s my pleasure to welcome Chris Dyer to come give a talk today. So I first knew Chris when he was a graduate student at University of Maryland, where he became sort of the heart and the soul of the machine translation team, as Phil Resnick [phonetic] put it. He’s done a variety of interesting work there, and since then he’s gone on to do a post-doc at CMU, that’s where he is right now, and so today he’ll be talking about some of his recent research in machine translation. Thanks so much. >>Chris Dyer: Thanks, Chris. It’s really nice to be back in Seattle too. Actually before I went to grad school, worked at Amazon.com and I haven’t been out here as often as I would have liked since then, so you guys got the weather perfect. Okay. So we’re going to talk about machine translation today, and so consider the problem of translating this German sentence with the following cartoon semantics. Everybody’s interested in semantics, mine involves cartoons, and the framework that I’m going to propose is one where we generate a bunch of possible translations for the sentence. In fact, we might even generate all possible English sentences, and then we’re going to look at which ones violate certain wellformedness constraints. So there are going to be basically three kinds of errors that we might be interested in identifying. So the first are lexical errors, where we’ve incorrectly translated the lexical semantics of the source language intro the target language, we’ve mistranslated hund as cat. Here, a second class of errors is what I’m referring to generally as configurational errors, and this is basically; we’ve screwed up the word order of the translation. So here all the words are right, but still the semantics is wrong. And then finally, we want to have well-formed hypotheses in the target language, so they should be fluent. So this is a standard language-modeling problem, and so if we exclude all of these in which we find errors, we’re left with a good, adequate translation of the source sentence. And so, in general, I argue that there are three classes of constraints that we need to ensure adherence to for good translation. So, to recap, lexical, so this is the problem of getting the riverbank meaning of bank versus the financial institution meaning of bank, right? Configurational, which is basically the semantic or syntactic relation preserved in the translation to the target language. And then, finally, is the output fluent, and this is the standard language modeling problem. So as I talk about various models of various problems and translations, I’m going to make reference to these kinds of classes. This is just my claim about what you need to have a good translation. So the outline is I’m going to talk first about a framework that lets us encode constraints like this, and reason about them, and then talk about two specific case studies where I instantiate this model. First, the problem of word alignment, where we have to, given a parallel text, find how things correspond to each other, which is an important sub-problem in modern machine translation. And then a complete end-to-end translation system where we actually produce translation outputs given some foreign language input. Then I’ll conclude with some discussion. So I am working in a discriminative framework, and I’m referring to this as generate and test, purely just for sort of pedantic or pedagogical reasons, which is that there are two parts in a discriminative model. There is a generation component which determines the set of candidates that are searched, so what we’re writing here is we’ve got some input x, we generate a space of possible labels for that x, y, and this may be finite, it may be huge, it may be compact, it may be just a list, whatever, at the high level it doesn’t matter. And then we have a test component, and what that’s going to do is for each y produced by our generation component we’re going to evaluate the candidate for goodness, and so this is generate in task or discriminative modeling. So the particular framework that I’m going to work in is going to assume that we’re going to have a test function between that compares source sentences, which I’ll write as S, and translation candidates T, which are going to be translations of that sentence, and we’re going to parameterize this using a linear model with two components. So the first is a feature vector function, h, and what this is is a vector of real-valued feature functions that is engineered, that I assume, at least for the start of this talk, to be engineered to encode the main knowledge about what is likely to be something that distinguishes good versus bad translation. So these are going to be features that encode our knowledge of lexical semantics or the cross-lingual lexical semantics. You can think about features as hypotheses about what makes a translation good or bad. The second parties a weight vector, w, this is going to be learned automatically from data. So example pairs of sources and targets, and these are roughly going to indicate, each component is going to indicate, how predictive the corresponding feature is for the data at hand. So the first part is going to be something we write as engineers based on our knowledge of a problem, and the second part is going to be learned by fitting this function to the corpus of example translations. So what do, for example, lexical features look like? So in this input-output pair, we’ve got man biting a dog incorrectly translated as man biting a cat, and so how would we engineer a feature to detect a violation like this? So one possibility might be to say, well, if there’s some English word, or some word in the source sentence that’s hund, and some word in the target sentence that’s cat, let’s fire this feature with value one, otherwise it’s going to have a value of zero. And indeed, this will have a value of one in this sentence and in the correct translation, man bites dog, it will have a value of zero, so, if we learn a negative weight for it, we can say, well, this is a bad translation. Yeah? >>: What is the distribution you use to train the problem? >>Chris Dyer: I’m not saying anything about distributions right now. I’ll talk about the specific learning criteria. I’ll get to that in a second. So the problem, though, is a feature like this is inadequate because we might imagine a sentence about dogs chasing cats, and in that case hund would correctly, in a correct translation, co-occur with the word cat, in fact it’s probably likely to. So the way we might deal with this is by introducing latent variables, and these are going to be, this is going to be some additional structure in addition to the output space or the output variable and the input that basically we can also engineer to have any kind of form that we want. Here, I’m just drawing links between words in the input and words in the output. They could be trees, they could be semantic structures, they could be anything, anything we want up to sort of computational considerations, and we’re going to update our test function to now evaluate not only the input and output, but also the alignment. And if we have these alignment links from source words to target words, we can now reformulate this feature that basically sums over all of these alignment links and says, if the source end of one of these alignment links is the word hund, and the target end of one of these alignment links is the word cat, then increment the value of the feature. So every time hund aligns to the word cat it will go up and it will be zero otherwise. And so, in this case, we can fire for this violation but we can also conceivably have sentences where hunds and cats happily coexisting. So in terms to answer your question about the models that I’m using here, I’m going to be working in a probabilistic framework, probability provides an elegant account of latent variables. We can exploit general algorithms for inference and learning, things like expectation, maximization, very easy stuff, and also in machine translation there are a lot of generative models with a similar structure that are also probabilistic. And so this means we can sort of compare directly to these generative baselines. So, if you have more questions I’ll be happy to answer. So the first application of this is the problem of word alignment. And so what this says is that very often we have large corpora of sentences that we know to be translations of one another, and what we would like to do is determine which word in each sentence means the same thing as which word in the other sentence. And so basically we want to infer links between the two like this. This is a problem where we’re going to be focusing just on lexical and configurational constraints, so since we are observing both the source and the target, we are not going to worry about modeling the grammaticality of the target, so we’re just going to figure out what’s a convincing explanation of the lexical relationships in the two languages and their configuration. So in particular I work on unsupervised alignment where we don’t assume that we have any examples between the source and target languages given at training time, we just have big corpora of source and target sentences. And so the way this generally works is that we’re going to model translations, so we’re going to model the sentence pairs, which are the observed in our training corpus, and we’re going to use a latent alignment variable, which is of course, unobserved, but we’re still going to be doing this to maximize the probability or some penalized version of the probability of the observed training data. This has been, in terms of work in machine translation, this has been there since the beginning. However, my work is different in that rather than a generative model, and I’m using a discriminative model to tackle this problem. So I’m building on the framework proposed all the way at the beginning Brown, et al., which assumes a finite state generation function and it basically, so in this case we’re going to condition on the bottom string and generate the upper string. And the way this works is we go one at a time and we pick a particular word in the source and then we generate each word conditionally and dependently of the others given the alignment. So we just proceed from left to right through things. Now, Brown, et al., make a particular assumption, which is that the translations distributions for source words are independent of one another, and this is a common assumption we make in generative modeling. Well, at least until these [inaudible] methods have gotten more popular recently. It’s quite bad, I argue, for word alignment. So in particular what the model sees is rather than strings of words that have any sort of meaning, they just see opaque integers that don’t have any content. So in particular we’re missing things, like, well, look at these two sentences. If we know nothing else but we observe that these two have identical surface forms, it’s probably a reasonable hypothesis that they should be aligned, and if it turns out that the data doesn’t later bear that out, if we see many then maybe we can disregard that hypothesis, but, it seems like something we should be sensitive to. We might also have matching prefixes, matching suffixes, because of the common ancestry that German and English share, we actually have a lot of orthographic similarity for a number of common words. So in this case lange and long have a common root, and sort of appear similar, and then finally we also have, we’ve built lexical resources, so we have bilingual dictionaries that have been put together for other reasons for many years, we might just look up in one of these and see abschied can be translated as good bye. Now, generative models that work just on co-occurent statistics are quite effective. I’m not saying we disregard those or throw them out, I’m just saying we should have access to this information if it helps us improve our models. That’s what I’m advocating in this part of the talk. So what are the features I’m using? Well, basically I’m starting with the word-word indicator feature. So that’s basically what the generative model is. That’s the only thing it’s looking at. So I still have that in there, and then I also have various co-occurent scores that basically say that for any two pairs of words in the source and the target. How likely are they to co-occur under a variety, or how good is the strength of their cooccurrence according to a variety of different measures. A little more linguistically interesting, interestingly I also include things that indicate links between, for example, mounds in the source or parts of speech in the source to parts of speech in the target language. So we might, for example, hope to learn that nouns often translate as nouns, or nouns don’t translate as modal verbs. Actually I should say here that I’m not using supervised part-of-speech tags. Everything I’m doing here is completely unsupervised, but you can include supervised part-of-speech tags as well. Of course, identical word features, identical prefix features, and bind orthographic similarity feature that lets you learn words hopefully that have similar surface forms are likely to be good alignments with one another. So the second class of features that’s useful in modeling is what I’m calling configurational features. So lets consider the problem of translating or aligning English to Urdu. So these are both indo-European languages, but they’re typologically quite different. So the canonical word order in an English sentence is subject, auxiliary verb, object, and then and end of sentence, period, where as in Urdu it’s subject, object, verb, auxiliary, period. So if we instantiate this with an example, we’ll see basically to get an alignment right we’re often going to need to see a very large amount of reordering. We would expect to see it in our model, and in many parameterizations of alignment models it’s hard to predict that things should be out of order. We typically only are capable of modeling that things should be in order or maybe not to care. We’d like a model that could actually model that we expect to have them out of order. So how are we going to do this? So one way of viewing the alignment problem [inaudible] we can write the source sentence up along top and the target sentence down along the side, and these represent alignment links between the two languages. And what we’re going to do is we’re going to look at bigrams of the words taken as we proceed through the source sentence as it’s being translated to the target sentence. And so, for example, if we look at the first two words in the target sentence, these correspond to this [inaudible] and we can extract a feature that just is the bigram [inaudible] and so every time that particular, these two words are translated in this order we will fire that bigram feature, and then we just proceed through the sentence as follows extracting word pairs at each step. And so we end up with things like will and then period as a common bigram, or hopefully a common bigram. We would expect it to be common for English-Urdu translation, and of course the fact that English translates into Urdu in this order is not a fact of the particular lexical items that are used, it’s a fact of the syntax of the languages, so in addition we can have bigram features over the parts of speech, and so hopefully this will capture some generalizations. So -- oh, yeah? >>: [inaudible] >>Chris Dyer: No, these are just what’s the path taken, so in practice I actually have a number of other configurational features, so the model can also learn, for example, that things in the beginning of the sentence tend to align to things at the beginning of the sentence or not, and those things conjoined with particular parts of speech, so there are many, many overlapping features. Yeah? >>: So, this feature when it says on the source side [inaudible], you don’t care where these two words are? >>Chris Dyer: No, they could be anywhere on the source side. And in practice I actually do have features that also conjoin this with the distance between them. Well, I don’t have the lexicalized ones conjoined with the distance but I have these conjoined with the distance, so ->>: [inaudible] >>Chris Dyer: Yes. So it means that what the left and right side means here is that the order that they’re being translated to produce this output is [inaudible] first and then dubais. So there is an ordering there. >>: But it means that the translations are adjacent and sequential in the [inaudible]. >>: Right, but on the source there’s any number of gapping between ->>Chris Dyer: Oh! And it could have been dubais sujut [phonetic], so it’s not, this order isn’t the way, for example, we see dubais visit, so then we go backward in the sentence, dubais visit. Okay, so I’ll say a little bit about learning in this model. So, as I said, I’m trying to maximize the probability of just the translations, but I don’t know what the alignments are. So what I’m going to do is I’m going to sum over all of the possible alignments, and the way I do this is I, so say we have a given source sentence, [inaudible], I will use this finite state generation procedure to generate a lattice of all possible translations and their alignments, which can be coded in a lattice like this. The second step is to intersect this lattice with the target language strings, so small house. So what remains here is all of the different alignments that cogenerate this target sentence and only this target sentence, where as this first lattice contains all possible translations and all possible alignments. >>: [inaudible] >>Chris Dyer: So I’m considering you can translate either the first word or the second, right. So it’s both all alignments and all sorts, so these are very, very large lattices the way I’m defining them here. >>: [inaudible] >>Chris Dyer: No, it’s going to be very big. It’s still polynomial but there are a large number of lattices, so the reason it is not exponential is because you can translate house twice. I have no constraint on translating words exactly one time, so it’s not quite as bad as the traveling salesman problem where you can only visit each city once. As you point out, yes, it’s very, very large and very slow, so the solution I take is very simple, is just a course to find approximation where in that first lattice. Rather than generating all possible translations under the model, I have a simpler model, and if a translation is very low probability under that simpler model I just never construct that edge. And the simpler model that I use is just IBM model 1 and I find that if I use that I can prune down to, or the inference time drops down to about one second per sentence, and the oracle error rate that I get is less than 4% doing this, and there are probably ways to make this quite good. And I should say you can’t actually get a perfect error rate because of the assumptions of the model, so it’s not that I would be at zero without this assumption, I should have the gap there. But, anyway, this is the approach that I take. It gets down to about one second per sentence pair, which is reasonable, it’s comparable to, for example, the decoding time with something like the IBM model 4, which is a widely used generative model. So alignment is an interesting problem to evaluate. So it’s used, the reason we study it is it’s an interesting problem; it lets you simplify some of the modeling assumptions that we worry about in translation. Rather than having to predict translations, we just have to explain translations, so it’s nice as sort of a test bed but it’s also useful. So typically we use these word alignments to constrain grammar learning procedures where we learn to translate, sorry, where we build grammars that are used in more sophisticated models downstream. So if you have bad alignments in that first step, then you’re going to learn less good alignments in your downstream phrase base translation models. So in this evaluation component, we’re going to look at two things. We’re going to look intrinsically at the quality of the alignments and see are the alignments, do they match up with our intuitions about what should be aligned, and then we’re going to use these alignments to replace alignments produced by generative models and see how that compares in standard state of the art translation models that build on word alignments. So our baseline is going to be IBM model 4, which continues to be a state of the art generative model. It has a very similar structure, all though it is parameterized quite differently. And we’re going to look at a bunch of different language pairs. So the first thing we’re going to do is look at a comparison of what the model produces. This is, of course, a cherry-picked example but it does illustrate some difference in behavior between the two that are important and interesting to note. So the first thing to see here is that, so on the left is the IBM model 4 alignment, and you’ll notice that the word dislike here has been aligned to sort of half of this Urdu sentence, and this is a well-known problem in word alignment. One of your departed colleagues, in fact, wrote a paper about this a number of years ago. And what happens is, baseline is a conditional generative model that’s trained to maximize likelihood, and you can think about this as basically every source word, so we’re translating from English into Urdu now, every source word has one unit of probability mass to give away. And when you’re maximizing the probability of the target string, one way to gain that metric is to take rare words, like dislike, and just use them to explain a whole bunch of the sentence. Even though dislike isn’t a very good translation of the rest of the sentence, it’s a good way to improve the likelihood of the training data. We see this happening in generative models. And on the right my alignment doesn’t have this problem. >>: Are you going to say a little bit more about your parameter estimation techniques? >>Chris Dyer: Not really, no. >>: This underlying IBM model has normalized emission and transition distributions, so ->>Chris Dyer: Right. So ->>: Assuming yours does not. You just have a series of features that fire and you get a distribution over an alignment and source sentences. >>Chris Dyer: Right. So everything is a globally normalized model, there are just features. I do have an L1 prior on the parameters so it encourages some sparsity. There are, for some of the larger models, there’s a potential feature space of upwards of 80,000,000 features and then the active number of features usually ended up somewhere around 10 to 20 thousand. So there are some dense, highly informative features in the model that we use that explain a lot of the data. So there’s no way for this model to get the same abuse, because there is no probability of dislike translating into any other word. There are just features that have to come together. So it’s a global score rather than consisting of a bunch of local products that -- right. >>: So we avoid some of these label biases. >>Chris Dyer: So the second thing to notice is that we manage to, in our model, have this big ump from the end of the Urdu sentence up to the middle of the English. So this is exactly what we expect when aligning a SOV language, like Urdu, to SVO like English. And this is actually something that the baseline model is very poor at modeling. It basically says, well, model the probability of jumping a certain distance. And if you look at enough sentences, you basically always conclude that it’s a very bad idea to jump very far away in alignment. That’s the best decision to make; there are probably not enough parameters to model this effectively. Whereas our model, I’ll show some evidence in a second, that we are managing to get models that say this kind of jump is actually expected. So that’s good to see even though this is just a single example. So to recall, lets think a little bit more about these features. So I introduced these path bigram features that says what is the order through the English sentence that we’re taking when we’re translating into Urdu. And so we can look at the features and see what’s been learned. And if we look at the top 10 most highly weighted bigram features, we see that we’re seeing frequently that a verb followed by an end of sentence punctuation character is highly weight. So these are just the top-ten most highly weighted features. And these are not highly weight English bigrams. If we look at the English corpus we’re not going to see "will" followed by an end of sentence period very often. So these features are being learned not because they’re present in the English, but because they are good explanations of the order of the Urdu output. So this is some good evidence that the model is doing something sensible and not reproducing what’s in the input. So the second intrinsic evaluation we can do is actually ask humans to write, so given a set of translations we can say, humans, draw lines between the words or translations of each other. And this is a surprisingly difficult problem even for natives because it turns out that words don’t typically translate exactly between languages. There’s a lot of fuzziness around the edges. So typically you can get about 80% that’s really easy to do and then the last 20% you have to write a 300-page style guide to get agreements on. But we can still do this. For one the language pairs I was looking at, Czech-to-English, I did have some data, about 500 sentences, and then we can evaluate the performance of our model by looking at the precision and recall of it’s predictions versus the gold standard alignment points. SO if we look at our baseline, the error rate, so this is one minus the f -measure, the baseline is 23.4% and we drop to 20.5%. So it’s a reasonable improvement. The second number that I’m reporting here gets at the issue of garbage collection a little bit. So the garbage collection phenomenon occurs with rare words in the source language. So words that occur one time are likely to be victims of this garbage collection problem. So what I did here was just computed the average number of words that each single type aligned to. So in the generative baseline we see that it’s 2.7 words, and in the discriminative model it drops to 1.6. So I’m not saying that the alignments are better necessarily, just that there are fewer of them to these rare words, which is probably a good sign. Yeah? >>: [inaudible] >>Chris Dyer: So garbage collection is this phenomenon here. So dislike turns out in this corpus of English to be very rare, it only occurs once, and we see that it’s been aligned in the baseline model to a lot of words. And this is a particular well-known problem with these generative models. >>: [inaudible] >>Chris Dyer: No, this is the model prediction. This is the output. This pair of sentences was in the training [inaudible]. >>: So [inaudible] >>Chris Dyer: So there are various kinds of priors and regularizes you can put on these generative models, which will also help address this issue. I’m just using the standard formulation for them. So at any rate, this just shows that this is weak evidence that the model is doing something a little more sensible. The gold standard in alignment evaluation though is looking at the performance of the alignments when they are used to construct a state of the art translation system. What I’ve done here was compared three conditions. One, the baseline alignments, one the discriminative alignments, the third condition where we combine the alignments from both the generative baseline and our discriminative model and use both to extract translation rules in a hierarchical phrase-based translation model that’s currently widely used. And we evaluate them using three different standard MT evaluation metrics, the arrow indicates whether up is better or down is better. And we see that over the degenerative baseline we see relatively small but reliable improvement. And then interestingly when we combine the alignments from the generative and discriminative base, we fairly substantial improvements. This is on a checking-list translation set with one reference and one blue point there, for example, is quite good. >>: So for the baseline [inaudible] >>: Oh, yeah. So I didn’t say this at the beginning, but yeah. So in the methodology what we’re learning is these are directional models. There is a source language and there is a target language in the model. We condition on the source and predict the target. So we can train models. We train one in the forward direction, one in the reverse direction, and then we symmetrize the alignments produced by the two models. This is a standard; we just used a fairly standard technique to do this. There are more interesting approaches that have been proposed recently. Model 4 is the same. It’s also a conditional model like this. We are doing the same kind of symmetrization. >>: [inaudible] is also a directional and ->>Chris Dyer: Yes. So it’s got the exact same kind of latent output space. So it predicts wordto-word, or single word on the source side aligned to potentially multiple words on the target side. >>: I understand why IBM model 4 is directional, but [inaudible], why does it need to be directional? >>Chris Dyer: So in a discriminative model you can almost think of drawing an arrow from your conditioning onto your predicting. And it’s normalized with regard to that. So often in discriminative models you -- I mean, I could’ve jointly modeled it. It would have been -- the inference problem was hard enough conditioning on the source sentence then trying to predict both sides and sum over all possible English and French sentences, for example. That would’ve been fun but a little terrifying. So just to recap here, I’m just showing one slide of numbers on these translation results. We’ve seen the exact same pattern of results in a number of diverse languages with different data sizes and different language typologies and we’ve seen exactly these kinds of results. We’ve done some fairly elaborate significance testing of these various outcomes. I think these results we can say at this point are reliable. >>: [inaudible] >>: So actually we put the two -- we just made the corpus twice as big and just used one set of alignments for the first half and the other set of alignments for the second half. >>: [inaudible] >>Chris Dyer: So this is a trick I found is good, but one interesting thing is when you combine multiple generative models. So if you combine model 4 [inaudible] for example, you don’t see these improvements. You need some kind of diversity. This is an open problem really, what is it? These rule [inaudible], I don’t know how productive it would be to work on this problem of saying what is the right kind of alignment set to extract rules form. But there is sort of a sense that having a diverse set of word alignments is good. So it seems like what we’re getting are different kinds of word alignments. And indeed when we look at the kinds of improvements that we see here, we’re actually doing better at translating fairly rare entities. So the generative baseline, it appears, is systematically not having translations for a lot of rare words because every time a rare word occurs in the source language, it tends to garbage collect. Because of the standard rule extraction procedures used, it basically blocks extraction from that sentence. So it can’t learn anything when you have that garbage collection. And so what we’re seeing here is -- when I did some manual error analysis of this sentence is we actually see that fairly rare words that were out of vocabulary previously are now getting translated. So not only are these slight improvements, these are good improvements. We’re translating high-value words -- I mean, all words are high-value, but most of these evaluations analyze getting an article incorrect rather than getting a content word incorrect to the same degree. There is still of course an intuitive sense that you want to have the content words at least correct in your translation. >>: So there was a step from just a simple multinomial distribution to a feature distribution, another step from a set of feature distributions to a globally normalized distribution over [inaudible] alignments, right? >>Chris Dyer: So I didn’t go through that middle step. I just went straight from -- my sort of straw-man foil is this model 4, which is a bunch of multinomial distributions. You can reparameterize those ->>: An MEMM instead of a CRM. >>Chris Dyer: As an MEMM or even just an HMM with a log linear parameterization of each of the multinomials. So the Berkeley guys Taylor did this. I do have ->>: [inaudible] intuition about what gain comes from which? >>Chris Dyer: No, and this is something I’d like to explore a little bit. I include features those that can’t be formulated in the directed models because they would induce cycles, so the global normalization at least gives me freedom. Part of why one of the strengths of this model is it lets you formulate features based on sort of your intuitions. You don’t have to change the structure of the models. So one thing, if you look at the way IBM had to do to go from model two to model three when they introduced fertility, although I think fertility actually came first, they developed model one and two later, they pretended they came up with model two first and they had to completely change the structure of the models just to get a new kind of features. This, while features will still -- they may change how difficult your inference is, the structure of the model can admit, or the undirected globally normalized nature of the model means that you can potentially include any features you want. So the idea here is to decouple your creativity from the structure of the model. Whether that’s good or not, I don’t know. >>: I understand the [inaudible] between model 4 and your model. What about Excel, what about previous [inaudible] ->>Chris Dyer: Yeah, so the big difference between my work and the previous work is they assume that there are alignment points given in the training data. So this is, this makes no assumption that such exists, and that’s the main difference. >>: [inaudible] >>Chris Dyer: Exactly. So in standard discriminative frameworks, they’re discriminating the alignment grid, I’m discriminating the target words and happen to have a latent variable. Okay. All right. So just a little bit of ongoing work in this area. So I’ve been relying completely on lexical features that -- I know about language. I actually have a Ph.D in linguistics, so I like to write down my own features, but I’d like to know, can we learn representations of words that can work well, for example. So one very promising approach I think is to make use of a vector space lexical semantics. And the basic idea is that each word is embedded in some high dimensional or some relatively low dimensional actually, vector space, and is a point. And points that are close to each other in this space have similar meanings. And so then basically what we’ve got is the problem of can we translate the vector representation of the word hund into the vector representation of the word dog. And how might this look? The answer is yes. There is a way at least to set the problem up that’s quite nicely, works quite well. The basic idea is that we assume some embedding of the source and target words that comes from somewhere. This has been an active area of research for many years, we’ll just reuse those, and then we’re going to formulate translation as just matrix multiplication. So we’re going to take the transpose of the source language word vector, multiply it by this interaction matrix, which will produce a vector in the target language. And we will just compare this target language vector to the words in our target language vocabulary, just with a dot product for example, and this will give us some score which we can exponentiate and still have the standard symmetrization formulation. So if we do this, we end up with what I call the log-by-linear translation model, and basically what we’ve done here is instead of taking a dot product between a learned weight vector and a feature vector, we have this term, which is this bilinear product of the source language vector, this translation matrix, and this target language vector. And the interesting thing about this model is that we can train these vector embeddings from large, unlabeled, monolingual data in the source and target language, which we often have copious amounts of. The hope is that we can learn this interaction, this translation model matrix, from small amounts of bilingual data, which is typically small and difficult to construct. Since we only learn d-by-d parameters, it’s typically a much, much smaller space of variables that need to be learned than we normally do in translation. So the hope is that we can learn to translate, we can learn to reliably translate words in the source language to the target language never observed in the parallel corpus that we’re learning from because we know something about their semantics based on their distributional profile or whatever construction we’ve used. Yeah? >>: I think this is really pretty but what happens when we have sense issues, right? And the English word [inaudible] can be translated in German as either [inaudible] depending on whether you know somebody or something, right? And it seems like if I were to learn a monolingual distribution for [inaudible] there might land in different places where the English word is going to land in a single place. >>Chris Dyer: Right. So ->>: How do you deal with these issues? >>Chris Dyer: So the translation model is working on that low-dimensional space. So there would be a part where no would have a single representation and there would be some kind of activation in basically both parts of the target language vector. So there would be some ambiguity. So basically you would still want to use a language model on the target side to distinguish between things that are ambiguous as we always do in translation. Yeah? >>: How do you determine the dimension of the vectors and why [inaudible] language? >>Chris Dyer: They don’t have to be. This is just where I started. I basically determined the dimension because there was a data release a couple years ago from somebody who compared a whole bunch of these and I could just download them and they were in 50 and 100 dimensions, and so I used 50 and 100 dimensions. But you’re right. So I actually think a promising -- I think this model isn’t going to work particularly well. I think we need slightly richer representations over here, so I may add this as a term to the current formulation that I’ve got so you can have both the energy term and the standard one. And also there are several different ways to construct these low-dimensional embeddings of words, and I think we’re probably going to want to come up with a, to also in addition to learning this interaction matrix, we’re going to want to take say – okay, here are the [inaudible] embeddings, here are the [inaudible] embeddings, and we’re going to learn another matrix that projects those down into a d-dimensional space that’s actually used for translation. So some of these embeddings are, or here are some SPD embeddings, so we’re going to get several low-dimensional things and then we’re going to basically have something -- it’ll look a little bit more like a multilayer neural network or one of these deep architectures. But I don’t know set the structure. There are ways to learn the structure of these things. Again, inference is really scary even when you don’t have the latent variables that I have. So I don’t know how well that would work, but I have, as of yet, very few intuitions about what the right architecture is, but I think it sounds promising. Yes? >>: [inaudible] >>Chris Dyer: So right now I’m assuming these come from somewhere else. So I use a standard off-the-shelf technique for learning low-dimensional embeddings of word meanings. So there are established -- you basically often do something like build a very large vector based on the context that a word occurs. So you get every occurrence of the word hund in a large German corpus and say what occurs to the left, to the right, and within five words? And you run an SPD on that, and that’s one way of learning a low-dimensional embedding. And then I just define that to be five hund. But I don’t, at that point then, [inaudible] fixed when I learn W. And the reason is I want to be able to learn phi from a large amount of -- I want to just be restricted to what I’ve got in the parallel corpus to learn those meanings. >>: So the way to generate is to find function. I assume is a kind of generative model rather than a [inaudible]. >>Chris Dyer: No. >>: [inaudible] >>Chris Dyer: I haven’t thought too much about how to deal with OOVs here. The idea is that you’re going to be able to learn phi from some a large amount of data that it will be a much rarer problem than when you just learn translation models from just a small amount of data. So the idea is you’ll get some embedding for a very, very large number of words, every thing on the Internet say. >>: Do you have any idea how this framework has solved [inaudible]? >>Chris Dyer: Oh, that’s a good question. I haven’t thought about that yet. That’s a really interesting idea. So the question was like, this is for lexical semantics, how would you do this to model something like [inaudible]. There’s some really cool work on -- yeah, I don’t know. That’s an interesting idea. These have been applied very successfully to build very good language models. So translation modeling we’re just conditioning on something else. It’s the same problem. You had a question? >>: [inaudible] >>Chris Dyer: It’s hard to say. Since I’m modeling, since I’m evaluating this as an alignment model, I don’t ever really do predictive inference in the model. I mean I have run it. It’s not a very good translation model. It’s not as bad as you might think it would be, perhaps, but it’s still eight blue points off of a phrase base system. >>: [inaudible] >>Chris Dyer: Yeah, it’s hard to say. There are no constraints on the features, though. So you can definitely -- you can include context basically when you’re making a prediction, so you can join a word with a part of speech, or word to the left or right, or whatever topic or something like that. And I’ve tried those things. You can definitely improve the likelihood. Alignment plateaus fairly quickly, unfortunately. Yeah, yeah, yeah. Sorry. Okay. So quickly now, this part of the talk is much quicker. So we’re going to talk now about -- so alignment is nice, but it’s not the whole enchilada. So now we’re going to talk about actually doing translation and we’re going to focus on these configurational constraints. So one of the reasons why translation is difficult today is we don’t have to translate things wordby-word like we were doing in the first part of the talk. We can, for example, translate this sentence from Japanese, John apple eat into John ate an apple, by translating apple eat as a unit. And this is a great model, this gives us state-of-the-art performance and we got a good translation here, but if we rely just on this to get the outputs, the generalization is very poor. So what we want is the ability to do some kind of reordering, of course, in cases when we don’t have memorized phrases that tell us how to put things together in the target language. So what we want is something like this. In some cases we may have a phrase that does the reordering for us, in some cases we may need to do the reordering ourselves. So this is why reordering is a challenge. It’s also a challenge because it’s a really, really hard problem. Searching permutations is about the hardest thing you can do in computer science. It’s pretty easy to show that it reduces to a traveling salesman problem. If you want to compute expectations over a bunch of permutations, as I probably do if you want to learn, it’s even worse. You get into all these horrible counting problems. So basically what this means is when you’re modeling and doing inference, or when you’re doing inference and models of permutations, like we have here, you have to work with some kind of approximation. And a popular on in MT is to search an exponential number of permutations using a context-free structure. The basic idea is here you have some kind of tree structure and you consider swaps, local swaps, of that tree structure. And so my contribution here is to constrain the reordering that’s considered by a source language parse tree. The intuition is to think about parse trees as these [inaudible] that let you permute the daughters of individual nodes. Here we’ve go t this sentence, a dog chased john, which is, of course, an SPO sentence. We’d like to translate this into Japanese. So what we’d like to be able to do is know that we need to switch John and chased. So we need to do something like this. Now, as I said, in the phrase-based world where we might have learned that dogs chased john and can just use that whole unit all at once, we actually don’t know if we want to reorder this necessarily. We don’t know until we try and translate it, in fact. So what we’re going to do is we’re going to expand this tree. So in this my approach to reordering, we take all of the non-terminal productions, all of the multi, all of the rules in a tree that have multiple nodes on the right-hand side and we consider all permutations of them locally. In this case we have a forest now, which I’ve encoded up here in this picture that considers both orderings of the verb. And now we’re going to translate the sentences. We’re going to allow the model to translate any of these sentences as the possible source. So which sentence in this grammar that we’ve just constructed has the most target-like word order? So if we had examples where somebody had written down the sort of most English-like reordering of a Japanese sentence, we might be able to train a model like this directly. In fact, some people have done things like this. So there was a nice paper from Google last year that did this. But as I tend to do, I like to learn things like this as a latent variable. So rather than assuming that I can observe this sort of English prime order when I’m translating into Japanese, I assume that I have a model that says how likely is this English sentence to be reordered into this English prime before it is to be translated? Then I can model this just by saying how likely is, given some source sentence, the observed target string summing over all possible reordering? So because of the assumptions I’ve made, what I’m saying is that I’m going to encode the space of reordering of the source language in this context-free structure which is derived from a source language parse, I’m assuming the existence of a source-language parser. And it turns out that you can intersect this context-free space with a translation model represented as a finite space transducer using a variant of Earley’s algorithm. It’s just a slight generalization. The exhaustive computation of this intersection takes place in polynomial time and it turns out that, unlike the previous problem, where even though it was polynomial time it was too expensive, exact inference is possible. It’s very, very fast. You don’t have to do any printing, which is nice. So this lets us construct a context-free forest containing all possible translations and all possible reorderings of a particular source sentence. So the model I’m using consists of just six coarse phrase-based features. I’m not making any particular innovation there. Standard trigram language model, and then very simple configurational features where I just fire a particular indicator feature each time that just says what rule was used. So in the forest there might be a rule that says VP rewrites as VNP or it’s permutation VP rewrites as NPV. There’s nothing to say what the source-language word order is. There’s no information that comes about form a target language parse tree. We’re just going to have features that say what does it look like when a VP rewrites typically. So obviously we can imagine constructing a lot more informative and interesting features here. The experiments that I’m going to do here -- I’m going to look at three different cases. So the first is Arabic English. And Arabic and English are actually not very difficult. Not a very difficult translation pair when it comes to reordering. So there are some local differences. Adjectives follow the nouns they modify. And the one sort of large-scale difference is that verbs are typically the first thing in the sentence, and to translate into English you need to move that verb that occurs at the beginning of the sentence somewhere to the middle of the English sentence. The second set of experiments look at Chinese and English, and we’re going to look at a very small corpus of training data and then a much larger one. And Chinese and English is a language pair where word order matters a lot. So the local word order tends to be rather similar to English. Adjectives come before the noun they modify, but some larger structural units are in different places. Relative clauses and prepositional phrases come to the left in Chinese of the things they modify, where as in English they come after. And this is very important to remember because we typically are going to have phrasal units that memorize things like a noun and the cp that it modifies or that’s modified by a cp memorized as a phrasal unit. That’s just too much information. Or these are too many words to have memorized. So a good translation model for Chinese needs to be able to move these large structures around sort of as abstractions, not just by memorizing them. Here is first a list of what happens when we train this model to explain the English output of a Chinese input using the configurational features I talked about on top of this daughter node permutation procedure that I introduced earlier. The two things that I’ve got bolded here indicate that the model has learned to put prepositional phrases after the VP’s they modify and to put CP’s, which are what the Chinese Treebank calls relative clauses, after the NP they modify. And, importantly, these are the English word order, not the Chinese word order. So, again, seeing that these latent variable models can be used not just to reconstruct what’s in some senses going in on the source side, but coming out what’s observed as the target language. And the rest of the features are sensible and there were Chinese and English do agree. Okay. So we’re going to look at two evaluation criteria for the translation model. First, the model side of the number of rules, and second the translation quality. Just in terms of blue this time. So the translation model size, compared to synchronous context-free grammar baselines, which are current sort of state of the art in translation, the finite state translation model that I’m using in this, in my proposed approach, is about an order of magnitude smaller. And that can be obviously very important for a variety of reasons. And the way to think about this is that the SCFG model in some sense is making the translation model do two things at once. It’s making it come up with a model of reordering and it’s making it, of course, deal with lexical translation. So in a sense this is the product of the SFT and CFG that we’re using for reordering. And it’s just, you might say, over lexicalized. Now, so what does this benefit in model size? Does it come at a cost? We see a couple of things. For Arabic and English, the first line here, the model that I’ve proposed here doesn’t work quite well as the synchronous, context-free grammar baseline. And it’s down a little bit. But for Chinese and English, in both of the cases the low data condition and the large-scale sized data, we’re already seeing that this model is out-performing this current state of the art context-free grammar models, which are shown here. And just with these very, very simple features that decide what order to put the Chinese in before translation. And so this is very promising that we don’t have to -- we don’t really need these massive models to do as well at translation as we often have been building. We can still rely on things that look more like standard phrase-based translation models. So that’s nice to see. So for languages where you have large-scale reordering, this model seems to work well. Okay. So I know this has gone long, but I’m going to conclude by saying if you like any of this stuff, all of this stuff is available in some software I’ve released. It’s all open-sourced. You can try it today. And I’m going to conclude by saying – okay. Generative models are great. It’s really dominated for unsupervised learning, but the assumption that processes and their parameters have to be isomorphic is, I think, holding us back. And in particular we often end up coming up with computationally tractable approximations of what we think the phenomenon is that we’re trying to model. So instead of generative stories, we have creation myths. And generate and test models let us recover from that a little bit. So this basically just says that we can separate the process of generating a set of candidates from the evaluation of which of those are good. And, importantly, in models like this, when we’re constructing a hypothesis about what is a good output, we don’t have to have a whole generative or complete account of how some data got to be. Our model doesn’t have to be true. We just have to declaratively state about which variables correlate and how. We don’t have to say why. And we don’t have to have a complete account. We don’t have to have say if we want to model an English morphologies relationship to tense, we don’t have to have a complete account of temporal semantics and morphology to put this into a model. We just need to say here are some features that look at tense, here are some features that look at English morphology, and then if this is a good fit for the data, you’ll get a better fit. And if not, the feature will be regularized out. So we don’t have to know why things correlate. Basically the theories that we rely on can be partial and descriptive. And that’s really where we are in language these days. We don’t have complete accounts of the phenomenon we’re interested in modeling. Here we don’t need to. So when we’re doing, especially unsupervised learning, we need to be able to build all the knowledge that we can. And feature engineering is a really, really powerful tool to do that. So Ill conclude with that and thank my many collaborators in this work. So thanks. >>: Thank you. [applause] >>: We had a long discussion section in the middle, but if anybody has remaining questions. >>: So how much of the feature-engineering do you think has to be language specific? I was noticing you had a constraint blocking nouns from [inaudible] ->>Chris Dyer: Oh, no, no, no. That was an example. So I just had indicators that fired when a noun was hypothesized to have aligned to, or any part of speech, was hypothesized to link to any other part of speech, I just said I think part of speech alignments are useful signals for alignment. And those were examples that I thought might be reasonable to assume. >>: [inaudible] >>Chris Dyer: Right. So the great thing about these models is that the features are just hypotheses about the correlations and the data has the ultimate say in whether they’re useful or not. So in different language pairs feature will have very different weights. So a good example of that were those path bigram features where in English to Urdu translation we see auxiliaries followed by period as a high-weight feature that is very different than what you see in Chinese. But, more broadly, your question about language pair specific, you can be surprisingly agnostic about this stuff. I did have a few things that assumed that the two languages shared a common script so that things like edit distance were meaningful. And since I was working on things like Urdu, I did some kind of fake Romanization to make that assumption a little more reasonable. You don’t have to have those features, though. You can throw other things in. And to some extent that’s really what motivated, though, this completely unsupervised learning of the representations though. Can we learn everything and not do much engineering at all? >>: [inaudible] so we’re talking about situating this in a space of reordering Google’s paper? >>Chris Dyer: Yes, yes, yes. >>: That’s not what you’re doing, actually. >>Chris Dyer: No, it’s not. So I’m -- rather than making a hard reordering decision, I’m saying that we don’t actually necessarily want to make a hard reordering decision, because we may have already memorized how to do the translation. And in some cases, for some noun-adjective pairs, we may have it memorized; for others, we might not. And so rather than trying to make a decision about whether to pre-reorder or not, let’s just let the model make a decision when it comes to, you know, let’s let the whole pipeline do the inference rather than making hard decisions at each point. And then we further, rather than saying the reordering that are good are for a particular language pair, let’s just consider all of them within a certain class, sort of this ITG style reordering where you can permute the daughters of particular nodes but nothing else. So this doesn’t capture all kinds of reorderings. So things that, for example, certain processes do extraction like Chinese has W-h words in situ [phonetic] where as in English we put things like what and why we move them to the front. This model can’t capture that because that’s more than just the transformation of a node’s daughter. >>: Then you are learning the weights for these reorderings? >>Chris Dyer: So each rule here that we see, so you can think about this tree, which is an input, gets transformed so each of these binary children gets transformed into two different nodes with two different orders. Each order of, so VP now can be written as NP followed by VBD or VBD followed by NP. Each of these will have a different weight, and right now the weight is just determined by the order of the rule and what its left-hand side is. So that is what I’m learning as a latent variable. So each S prime is a string of a particular reordering hypothesis. I’m training the model to maximize this probability. So the target sentence, given the source sentence, is summing over all possible reorderings. So this is very similar to what I did in the first part of the talk where I’m modeling just the probability of the observed string pair summing out this latent structure. So if we find that, for example, switching verbs and their objects gives you a good account of the data and improves the probability of the corpus, then we learn ->>: So [inaudible] equation, then the left side is just based on those -- what is on the right side? >>Chris Dyer: This is just a standard translation model. >>: So, which is phrases, or? >>Chris Dyer: Phrases. So phrase is the standard. I just threw the standard Philip Kuhn phrasal features in. And I should say when I trained these features I didn’t have a language model. So this part I added the language model in and sort of in a second phase and then trained the whole thing with mert [phonetic] so that I got good -- >>: [inaudible] to the right has a bunch of [inaudible] and the part to the left is just like [inaudible]level features? >>Chris Dyer: Exactly. Yeah. So I collapsed all of this into a single top-level feature. Yeah, it was a single top-level feature. Yeah, so the basic idea here is I didn’t want to have to assume that somebody had annotated reordered sentences for reasons. One is that’s a really artificial task and two, you know, in some cases we don’t want to reorder because the translation model knows how to translate something as a unit. So I did both. >>: But the people are doing that, they actually like [inaudible]. >>Chris Dyer: Yeah, I guess so. I could’ve iterated it and then relearned the translation model. It all seemed kind of artificial. >>: But you haven’t actually been able to compare against those systems ->>Chris Dyer: You know, that’s actually a good question. Nobody’s ever brought that up. So I have run an SCFG model with Philip’s Chinese pre-reordering rules, and I did this and I know [inaudible] did this, and neither of us found that it helped on top of an SCFG system. So it helps with Moses, but basically the results are the same. And from what I remember, I don’t think there was ever any comparison of the pre-reordering systems or with a phrase-based system to hiro [phonetic] so I don’t know actually where that would slide in, whether it would be here, here, or here in terms of the scores. >>: Any hypothesis to why [inaudible]? >>Chris Dyer: Yeah. First of all, Arabic parsing is terrible, and especially the span of the subject, which is exactly what you need to get in order to get the reordering of the verb right, is just off. And I actually don’t quite understand why it’s as hard as it is. Some people, who really, really care about Arabic nlp, have been working for a very long time and the results are just, like, they’re bad, and much worse than something like determining prepositional phrases in Chinese, which is what you have to target in Chinese. The other things is the trees are really, really flat, and so this permutation base model -- I also introduced some constraints on if there are too many daughters of a node that you only do some local permutations rather than the whole factorial number, just for tractability reasons. And that may be hurting. But it’s probably not that I’m leaving out good things, it means I’m probably just not getting good estimates of the reordering parameters. >>: You could, for example, consider other, you don’t have to constrain yourself because the training doesn’t care what permutations ->>Chris Dyer: No. So the permutations -- so I do need to be able to compute. I do need to be able to sum out compute certain expected values efficiently, and the dynamic programming algorithm that I was able to use assumed that there was a context-free structure on the input that could be intersected with a finite-state representation of the translation model. I think many of the transformations on trees still result in context-free structure so you could do more things, but as you add more and more to that source grammar, that inference problem becomes harder computing those expectations. >>: [inaudible] [laughter] >>Chris Dyer: All right. Thanks. My apologies for going long, but I think the discussion was good. [applause]