>> Will Lewis: That's great. So it's clearly -- everybody really wants to be getting into the talks. It's been great. We've seen people already going around the posters there. So I'd like to welcome you, though, to the first -- what we've called the extended abstract session for SMT. And probably the best person to have been hosting this would have been Anoop Sarkar, my colleague from Simon Fraser. But since he's not here, you get stuck with me again. I know Anoop and I always get mistaken for each other, so my name tag is here so you just don't get too confused here. So for the two presentations, the first one is going to be by Baskaran Sankaran. And he's going to be presenting a paper on incremental decoding for phrase-based statistical machine translation. So Baskaran, it's all yours. >> Baskaran Sankaran: Thank you. [applause] >> Baskaran Sankaran: Good morning, everyone. Thanks for coming for this talk. This is a joint work with Ajeet Grewal and Anoop Sarkar. And I'm going to talk about the idea of incrementally decoding a sentence in a phrase-based statistical machine translation. So just to give an outline of the talk, so I'll first motivate the problem before going to algorithmic details of the incremental decoding, including some of the details about future cost computation and delay pruning. And then I'll present the evaluation results and then give some concluding remarks. So the problem, it's -- it can be motivated easily. So it's what I called a TAYT, or Translations As You Type. So what I essentially mean is I want to get the translations as the sentence is being input by the user for every word instead of waiting for the entire sentence as input and then the translation is done. So I want the translations to be generated for every word. So thinking about this, we can sort of think about two approaches, which is quite obvious, like one is to re-decode every time a new word is added. So first time that new words comes in and you decode for that word, and when you get the second word, you again decode it again from the scratch. So a slightly tricky thing would be to use -- reuse the partial translations that were generated earlier and just incrementally decode for the new word. So my talk is going to focus on this thing. This is what we call as incremental decoding. If we think about the applications, it could be used in realtime chat or language learning. And interestingly it's already available in Google Translator and possibly some of the other things as well. However, I want to just mention that though it's available in Google, it's still not clear about the type of algorithm that they use, and to the extent we are familiar with this, this is the first work that presents this idea. So this is our objective. We want to get this incremental decoder implemented, and it uses the partial decoder states. And which we hope will be faster than re-decoding from the scratch every time you get a new word. And one of the very significant problem is the search error because we really don't have the -- we haven't seen the entire sentence. We're just dealing with the partial sentence at every step. There are issues in computing the cost. So how do we reduce or minimize is search error. That's one interesting research question here. So before going into the incremental decoding this is a very brief overview of beam search, regular beam search decoding as is done by Moses. So given a sentence, Spanish sentence like this, the beam search decoder translates increasing number of source words at every step. So at the first step it translates one word -- I mean different words in the source language, and at the second step it extends the translations that were generated earlier and translates additional words. So this is how it is done. And at every step it keeps track of two cost: one is the current cost and the other one is the future cost. So the current cost is the cost of generating a translation, let's say the first one, Mary, the cost of generating the translation for the word Maria is as Mary, that's going to be the current cost. And the future cost is going to be the cost of translating the rest of the words in the sentence. A bit more detail about the future cost. So the future cost is computed for all the phrases in the sentence, and then it's aggregated over all possible sequences in the sentence using some dynamic programming. Typically people sort of do it as a preprocessing step, even before the decoding is started. So these are some of the challenges in incremental decoding, like how to reuse the partial decoder states efficiently. And the next question would be the sentence is not complete yet. So how do we compute the future costs. Then we started posing the question as do we need the future cost at all, like in order to efficiently decode. And the other problem, as I mentioned, the search errors, which of course due to pruning. Because in a practical setup like a realtime chart and other things, you really want to keep a smaller beam size. And in addition to that, you are having partial future costs. So some of the candidate hypotheses that are generated by the translator could be pruned much earlier, leading to search errors. So this is, again, an important problem. So, yeah, this is roughly the outline of the incremental decoding algorithm that we use. So for every new word WI that's added, we get the new phrases that could possible be generated for this word. And get a corresponding translation options. It should be noted that we are not trying to get the phrases for the words that are already typed in and getting the translation options. So those things are there as part of the partial decoder states. And then for these set of new phrases, we compute the partial future costs and then update -- do a limited update of the future costs in the partial decoder state. And then we trace through all the bins beginning from the first one until the current one and in this process we create new hypotheses and extend hypotheses from the previous bins. This is just like the regular beam search. So I'm not going to go through the details there. And then when it reaches the current bin, it just returns the best translation for that bin. So in action it works in this way. It has the word "having" first, which is the first word, it gets the corresponding translation options. And it puts the translation -- I mean, it creates hypotheses in the first bin corresponding for one word. And then it creates the second word and then it gets the translation options and creates new bin for phrases of line two, and it updates that, updating the earlier bins. So the process goes on like this until it reaches the last bin when it will have generated the target sentence. Here we don't -- we are not computing the full future cost. Instead we are just computing the partial future cost. And we are just reestimating the partial future cost for all the sequences and just updating the decoder state. And we found that just doing this is much faster than computing the future cost again from the scratch. As we'll show later in the results. Yeah. So that's an interesting problem they mentioned earlier. So pruning is part of the cube pruning. We have the cube pruning implemented as part of the decoder. So the actual pruning happens in a cube pruning step. So we have set the beam size and the cube pruning pop limit to be the same. So essentially the cube pruning pop limit will just pop out as many translation hypotheses as it is being specified, which is going to be a small number. And when this happens, all the other hypotheses that have poorer scores than the tall popped out hypotheses, they are going to be pruned out or just ignored. And as we'll see later, this pruning is risky. And this will get even worse with a smaller beam size, and the issue is partially due to the lack of the availability of the full future cost. We just make this key observation like the best candidate could have been pruned already. So this is something that's key observation for us. So when we started looking some solutions for this and started scratching our heads, we came across this. It's a story. Once upon a time there was a race between a tortoise and a hare. And we all know the rest of the story. Huh. Looking at this, we realized, okay, possibly we can draw some analogy from this. And when we try to do this, we found that we are having few hares and running with large number of tortoises. And what are hares? Hares are the hypotheses that are passed by the cube pruning step. And what are tortoises? Those are the poor scoring hypotheses -- I mean, poor scores compared to the hares, not necessarily very poor scores. And these poor scores could just be at the particular decoding state, and things could improve and new words are added at a later time. I would like to just stress this point, like when we compare the tortoises and hares, the comparison is in terms of the translation code and not necessarily the speed. So every story has a model, including ours. So here we think that a candidate translation with a poor score at this stage might beat a better candidate at a later stage when more words are added in the sentence. Or to put it in a different -- to put it differently, we don't want to discard the candidates, at least not yet. So with this understanding, we introduce this notion of delay pruning. So as the name suggests, the goal is to delay the pruning and give the candidate hypotheses a chance to improve. This is our goal. And at the same time have the constraint not to let the search space to grow to explosive size. And we achieve this by a two-level strategy. At the first level we delay the pruning and we selectively retain some hypotheses, some tortoise hypotheses for every bin. I'll explain how we do that on the next slide. And at the second stage we prune the tortoises that do not improve beyond a set limit. So delay phase. So in every bin we had different coverage -- different sets of hypotheses that cover different sort of work, which we call as coverage vector. So we first identify all the coverage vectors that are not represented in the bin at the end of the cube pruning step and for all such unrepresented coverage vector, we sort of generate a few small number of hypotheses, tortoises. Typically we generate up to three tortoises for every coverage vector. And then we compute the normalized language model score, and we use a threshold value for the normalized language model score and only retain those hypotheses that have a better language model -- better normalized LM score than the threshold. And we typically found the values minus 2 and minus 2.5 in log scale to work better based on the small depth set. And once we identified all these tortoises, these are flagged as the tortoises, and at each subsequent bins we expand them minimally. And then coming to the prune step, giving the candidate hypotheses a chance time improve is a good thing, but at the same time we don't want to let it forever in order to contain the search space. So we prune the tortoises beyond what we call a race course limit. Race course limit is nothing but the chance given to the tortoise hypothesis to improve and beat some of the hares and break into the cube pruning step. So we define this in terms of the number of bins that a tortoise has in order to break into the beam. And we actually experimented with different values of the race course limits from 3 until 7, and I'll show some interesting results later. And one interesting -- or one important observation is that downstream some of the tortoises might beat the hares. And when that happens, these tortoises are unflagged and merged into the beam. Yeah. This is -- this slide highlights the differences between the incremental decoder and the regular beam search decoder. So, number one, we use partial future costs. The future costs are computed only for the new phrases and are updated in the decoded states. And, number 2, existing hypotheses are reused. And the third important thing is the introduction of the delay pruning. Now, moving on to the evaluation, we implemented in-house decoder in Java, which includes the cube pruning and the lazy cube pruning. And we use seven standard phrase-based features. Our decoder supports both regular beam search and incremental decoding so we can run it on any mode. So we use French-English and English-French for evaluating this trained on WMT-07, used Moses for training and MERT for optimizing weights. So we did some -- five different experiments. One is we wanted to benchmark our decoder with the Moses in order to compare them in the regular beam search. And secondly we wanted to compare the -- use our decoder and compare the incremental decoding with the idea of re-decoding from the scratch and show that incremental decoding is better in terms of translation quality, speed, whatever it is. And the third thing is we also experimented with different race course limits. Here the first two rows gives the comparison where -- of our decoder and the Moses in the regular beam search setting. So this result is for the French-English and the next page has the English-French. We found this -- our decoder to have a slightly higher score, and which we believe like indicates that it compares favorably to the Moses and that the last three lines compare to the case of the incremental decoder being compared with re-decoding from the scratch. So as you can see, when we use the strategy of re-decoding from the scratch, it has the Bleu of 26.96, but when we use the delay -- incremental decoder, Bleu actually -- yeah, it slightly improves. Though, however, we believe that this will have the search errors. And then we implement -- then we sort of tested the incremental decoder with the delay pruning and we found the Bleu to increase significantly, which just comparable to this. Just to mention, these three things use the beam size of 3, whereas this one has the higher beam size. That's why you see the difference in this course. And the results are almost the same for English-French, so -- except that Moses has a slightly better advantage here. And though Bleu indicates the advantage of using incremental decoding, we wanted to quantify the search errors. So we tried to compute the mean squared error of the incremental decoding with the regular decoding and see whether that indicates something. So what we did is this. So for the two settings of incremental decoding without and with delay pruning, we got the scores of the top hypotheses and we also got the scores of the top hypotheses using the regular beam search. And then the data got the mean squared error between these two sets, beam search with this and the beam search with this and these sort of numbers. As you can see -- oh, sorry -- the search error here for the case of with the delay pruning is significantly much lesser than the case of without delay pruning. And we also experimented on the speed of this. And these are the setup, the machine details that we used. And we were interested in finding two things. One is the average time spent on a decoding for every additional word, and the second factor that we were interested in is the average number of hypotheses that were generated for every bin. And when we did this, we found that really -- I mean really coding from the scratch is much lower, like by a factor of nine, when you compared that with the incremental decoding. However, when you compare the average hypotheses that were generated in the bin, the delay pruning resulted in large number of hypotheses being generated. But despite having large number of hypotheses, it could actually achieve higher speedup than the re-decoding thing. We also experimented on the race course limits, different race course limits. For the value from 3 to 7, we see slight differences in the Bleu code in both French-English and English-French. They are not statistically significant. However, what we also found is the average time to decode also increases with the increase in the race course limit. This is obvious because as you increase the race course limit, this increases the number of average hypotheses that are retained every bin. As you can see, it shows an increasing number trend here. And we were trying to analyze why increasing the race course limit doesn't increase the Bleu. And we believe that this is possibly due to the lack of long-distance reordering between English and French. Between English and French, whatever reordering that we have is only local reordering, like noun and the adjective being turned into adjective and noun. And we could easily capture that with a race course limit of 3. So have a race course limit -- higher race course limit of 7 doesn't actually help. That's what we found. Just to conclude, we presented the incremental decoding algorithm which we believe is effective in the realtime scenario than re-decoding from the scratch. And we introduce this idea of delay pruning, which we believe is a normal method that can adapt to the text in sort of having a fixed-set beam size. And we also believe that the delayed pruning could possibly be used in regular Viterbi decoding in the regular beam search. And this could potentially be helpful when it's integrated with a cube pruning step. Though we haven't done that, though. That's it. Thanks. [applause] >> Will Lewis: Thanks, Baskaran. We have ten minutes for questions here. So I see we have a few already. So go for it. >> So you mentioned you allow limited expansion of the [inaudible]. Can you elaborate [inaudible]. >> Baskaran Sankaran: Okay. We -- as I said, we select up to three tortoise hypotheses for every unrepresented coverage vector. And for those three higher tortoises, we compute the normalized LM score and then we again do a further selection by removing the tortoises that have poor normalized LM score. We compute the normalized language model score up to that point and use the threshold of minus 2. Yeah. I think we use minus 2 in all our experiments to further limit the number of tortoises. >> Will Lewis: We have microphones going around, so just raise your hand and you will get [inaudible] if you have a question. Hold your hand high. >> Thanks, Baskaran, for the interesting talk. I wonder -- so are you using a fixed distortion limit or reordering limit during decoding? >> Baskaran Sankaran: That's right. We are using a fixed distortion limit. >> A fixed. Do you think that it's possible to bound the amount of bins that you need to throw away based on your distortion limit? >> Baskaran Sankaran: The amount of hypotheses? >> The amount of hypotheses that you need to recompute, given a fixed distortion limit. Do you understand what I'm saying? So if we assume that every hypothesis has to jump to the end of the source sentence at the end, then we can bound the number of bins that need to be recomputed. Do you understand what I'm saying? >> Baskaran Sankaran: Not so ->> So if we think about it in a monotone -- have you thought about doing incremental decoding in a monotone scenario where no reordering is allowed? >> Baskaran Sankaran: Oh, okay, okay. >> Then basically you can completely reuse all your previous hypotheses, right? Because future cost estimates are actually not necessary. And so there's a generalization. I was just wondering if you thought about trying to exploit your distortion limit to bound the amount of hypotheses that would be affected by adding an additional word. >> Baskaran Sankaran: No. But [inaudible] use of fixed distortion limit, what we do is for each of the hypotheses, before they are extended, we sort of look ahead into the future and see -- I mean, before extending this into a different hypotheses, like if this can further be expanded, because this could be -- it may not be able to expand that further because of the distortion limit, so we sort of do some pruning -- I wouldn't call it pruning, but, yeah, checking. And if we find that they cannot be expanded at a later point, we don't expand them. I mean, we don't create new hypotheses out of that. >> Baskaran, trying to understand the contrast between your method between delay pruning and partial future costs versus like from scratch every single time, I'm trying to understand if there are factors involved in certain -- a threshold or a tilting point beyond which your method is advantageous over do everything from scratch. And on the other side of which it's not advantageous yet, so I'm suspecting sentence length might be one. >> Baskaran Sankaran: That's right. >> Does your race course limit -- does it have anything to do with sentence language? >> Baskaran Sankaran: No, no. Yeah, but actually we found that it doesn't help us to store partial decoder states beyond a set endpoint. So because they are never going to be reused again. So we try to limit the partial decoder states between three words and 15 words, and beyond that they are just really coded for the rest of the words. So if you have 16 words, it would be just incrementally decoded for 1, and if you have 17, the incremental decoding will start from word 16. >> 16 [inaudible]. >> Baskaran Sankaran: Yeah, that's right. But for this -- for these experiments, we sort of stored all the decoder state, all the partial decoder states. Yeah. >> Okay. And the other factor, you made a brief allusion to that along the way, might be the contrasting nature of dependencies in the sentence between the source language and the target language. And on the lighter side, one of the classic examples I came across as a student was dependencies in a complex sentence like John saw Peter help Mary swim, where, you know, if you're translating it into German, it is one of those push -- push onto stacks and pop off the stacks kind of relation for the verbs. And in Dutch it's even more kind of perverted, it's kind of cross-serial dependencies. So I'm thinking that, you know, as you translate word by word from English to Dutch, let's say, you -- it may be where you approach [inaudible] retaining partial work as far as possible might prove to be superior. >> Baskaran Sankaran: Yeah. That's an interesting question. And, yeah, that's one thing we haven't done at this work, like what will happen if there's much reordering that's happening. >> In the simplest of the terms, the ordering of the modifier and the modified, between mathematics test and test the mathematic, it's a very, very simple kind of a thing. >> Baskaran Sankaran: That's right. I understand that. >> But between rules and verbs and in complex and embedded sentences and so on, it's a slightly more higher magnitude complexity. So it seemed that the contrast between languages might be relevant. >> Baskaran Sankaran: That's right. Yeah. But the thing is one thing that we could do then is to play with the distortion limit and maybe have a higher distortion limit. The other thing would be to have reordering models included in them. Right now our model doesn't use any of the reordering model. So that might be one possible way to handle that. >> Will Lewis: Great. Thanks. We have another question. We have time for one more question. All the way over here. Josepe [phonetic]. >> So for like people like myself who are not exert in machine translation, to understand your results, what is the range for Bleu and how good is .27? >> Baskaran Sankaran: Okay. So, yeah, there are different variations in Bleu that you could compute, like with casing and without casing. And all the numbers were on detokenized data and then recased. So that's where you probably see slightly less in numbers. But they're comparable across -- comparable across our decoder and the Moses. But I guess for the French-English, the reason we included WMT, I think, if I'm right, had about 30 Bleu score, 30 or 31. Yeah. >> [inaudible] >> Baskaran Sankaran: Yeah. Oh, yeah. That's -- yeah. >> [inaudible] >> Baskaran Sankaran: Max is 100. >> Just because you don't match the human reference translations doesn't mean ->> Baskaran Sankaran: Yeah, it doesn't mean [inaudible]. >> Will Lewis: So this gives us a discussion point for the break, also discuss measurements and evaluation schemes. So I'd like to thank Baskaran one more time. Thank you very much. [applause] >> Baskaran Sankaran: Thank you. >> Will Lewis: Okay. Next my pleasure and with -- oh. Wow. This worked wonderfully. This is great. Thank you very much to the people that set this up. I'm pleased to be able to introduce Ann Clifton who's going to be talking with us about morphology generation for statistical machine translation. So thanks very much, Ann. >> Ann Clifton: Hi. I'm Ann. And I'm going to be talking, as Fred said, about some work that I've been doing with professor Anoop Sarkar at SFU on morphology generation for SMT. And which one am I -- okay. So to give you an idea of what's motivating this task, I have this little example taken from Turkish which shows how by packing a vast amount of morphology onto a single stem they are able to convey something that takes the better part of an entire sentence to convey in English. So as you might imagine, this leads to some pretty interesting challenges for statistical machine translation, particularly morphologically complex languages tend to exhibit a lot more data sparsity. In parallel texts, say, you can imagine that if you have a lexicon of productive stems and morphology, that this is going to give rise to a greater overall vocabulary of surface forms, each of which will appear more infrequently in any given text than, say, English or something like that. So this also leads to a great deal of source target asymmetry. An example that a lot of you might be familiar with is how in morphologically complex language with, say, a case system, then that would be used to express something that in English we would use a propositional phrase for. We'd express it lexically. So and then there's also just a general lack of resources for morphologically complex languages, which just has to do with the fact that they're relatively unstudied in comparison to, say, English or one of its related languages. So there are a few major camps of the approaches that people have taken to introducing morphological awareness into MT models. And one of these has do with preprocessing and using the morphological information in the translation model in order to increase source target symmetry. So this has been done for a number of language pairs. And some approaches tend to strip the morphologically complex language of some of the morphology in order to increase symmetry. Some artificially enrich the poorer language. And then some others retain segmentation aspects in the translation itself. It's worth noting here, though, that a majority of this type of work has been done on translation from the complex language into English. And as you might imagine, that's a substantially different task than going in the opposite direction. So another major approach is to perform the translation on a stunned version of the original text, which is an easier task, and then to do the morphology generation in post-processing. And people have gotten some good results with that, actually here at MSR, for some language pairs such as English into Arabic, Russian and Japanese using maximum entropy Markov models. And the idea here is that you can capture dependencies between the morphemes as opposed to just words in a way that's difficult to do in, say, just a straightforward language model or something. So the last approach that I'd like to mention here is using factored translation models, which in the event of an unseen word are able to back off to a more generalized form, say like a stem plus morphological tags or some other piece of morphological piece of information. So we incorporated elements of these different approaches, and then we took as our jumping off point the Europarl dataset. And so what you're seeing here is the -- is how the blue scores, the translation performance measurement, correlates to the vocabulary size. And this is all for approximately 1 million parallel sentence training set, but the different languages have vastly different vocabulary sizes. So you see over here, French and Spanish, how the blue scores are relatively high and relatively small vocabulary sizes, and then way out here we have Finnish which perform the most poorly out of the bunch and has far and away the biggest vocabulary size at roughly five times that of English for the same parallel text. And to take just a closer look at that here, you can -- this is split into translation direction. And so you can see that on average translation into English does far better than translation from English into another language. And the worst of these is Finnish. And that's with that same 1 million sentence parallel corpus and a 2,000 sentence development set on which the model weights were tuned and then another held out 2,000 sentence test set. And so naturally we decided to pick Finnish as our target language. So a bit about Finnish. Here's an example up here of some of the types of morphological information that you can pack onto a stem. In this case they manage to get in one word something that we need a whole sentential question to express in English. So it's a highly synthetic language displaying both fusional and agglutinative properties. And so it's got consonant gradation, vowel harmony phenomena. And so what this adds up to is while you have a great deal of morphology that you can pack onto a single stem, it makes straightforward segmentation of this morphology pretty difficult. So what we started out with was just training off a baseline word-based system and then also doing a segmentation of the whole training and dev corpus and looking at the types of errors that we were getting, with the idea here being that if we could devise a way of incorporating the morphological awareness into our model in a way that could handle this type of complexity, then we could perhaps come up with something that would be robust to a great deal of different types of morphological complexity, but trying to stay minimally language specific here. So a couple of things that we saw were words that were missed by the MT system had available segmentations and some common error types. I'm going to give a little more detail in this. So this table up here shows that a lot of -- about half of the words from the original development set reference text were missed in the translation output. And of the words that were missed, a majority of them had available segmentation, so we wanted to make use of that. And then down here these are just some examples of some of the types of mistakes that the MT system was making wherein specifically it would get the stem right and the suffix was wrong. And so a bit about the segmentation that we did. We actually performed two types of segmentation. We did an unsupervised and a supervised version. For the supervised version we used a finite state transducer-based morphological analyzer that was based off of a precompiled lexicon from an annotated corpus. And so, as you can imagine, that's going to have some coverage issues, although high accuracy. And then the unsupervised segmentation that we used was from Morfessor which is not at all linguistically motivated in its segmentation. So we took the segmentations and for each of them we trained two different MT systems. We did a full segmentation model for each wherein we retained the entire segmentation and we just treated all of the morphemes as separate words, and we just inserted some word internal morpheme boundary markers so as to be able to restitch the morphemes back together in the output. And then we also did a stems-only model where we discarded all the suffixes and just did translation on stems. So here's the general overview of our system. So we started out by submitting the input to the morphological analyzer of either type, and then we would take this analysis and train the MT systems on that. So for the segmented systems, the output would look just the same as we saw coming on the other side, and so therefore it would be quite a straightforward matter to just put the morphemes back together again and be able to do evaluation against our original word-based reference text. For the stemmed output, it would be a little different because we would have to then figure out how to generate the morphology. So what we did here is we introduced the stemmed MT output to a conditional random field inflection tag generation model. So this would output the stem together with the POS tag. For that it would output a prediction of a tag set, of a morpheme tag set. And then we still aren't home-free yet because there's still the matter of then coming up with the actual corresponding surface form, which is a nontrivial task. So for this we used a language model to disambiguate between the possible surface forms that can correlate to a stem and morpheme tag sequence. And the idea here was that while the CRF model is trained on supervised data of which there's less, our language model can take advantage of the relatively large amounts of unannotated training data for the disambiguation purposes. And so, dare to dream, what we get on the other side is the original surface form back. So a bit more about conditional random fields. So the way this works is it selects the best tag sequence based on the probabilities which are just the product of some binary features over the current label, the previous label, and the observation sequence along with the model weights. And this is of course with regular normalization term. This is actually a global normalization term for the CRF. And so F here is a vector of these binary features, and then this down here is an example of the types of features that would be used in this model. And so you can see how it can compare the current predictions case value to the previous one. And so that gives you an idea of how this type of model might be able to capture agreement phenomena. So the motivation for going this route was that it could capture the long-distance dependencies between morphemes. It's globally normalized rather than normalized on a per-state basis. And it allows us to select our prediction sequence from the entire training data -- or all of the tag sequences available in the training data as possible outputs. And so we had hoped that this would make it more robust towards handing new surface forms that are the result of productive morphology that haven't actually been seen before in their training data. So we used CRFSGD, which is an implementation that's optimized with stochastic gradient descent. We trained it on overlapping lexical and morphological features such as for the stems, unigrams and bigrams of the previous and next stem and POS. And then for the prediction features we used case numbers and features of the previous labels. So the accuracy rate that we got was 79.18 percent which was definitely on par with similar tasks that had been carried out on different language pairs for the same monolingual features. So we decided that this seemed like a reasonable jumping-off point to introduce into our MT system to see how it interacted with -- because this was, of course, on the clean data, so to see how it interacted with the ostensibly noisier output of an MT system. And so this here compares the output of all of our different systems. The first is just from the original Europarl dataset. And so our baseline does significantly better than that, which we just attribute to having a better newer decoder. But so you can see that the best result that we got was for the unsupervised segmentation model which retained the morphemes in decoding, and the supervised segmentation model did a bit better than the baseline, but not to a statistically significant extent. However, our supervised stem model with the CRF for morphology generation did quite poorly. But still we introduced it as well to the supervised segmentation model just to see whether it could kind of go through and catch any hanging stems that had been missed by the model. But that didn't make any difference really just because that model didn't output enough hanging stems that were missing morphology to make a difference. So we include this here to evaluate the comparison on segmented data. And so this is before restitching the morphemes back together. And to make a comparison, we actually took the word-based baseline output and then segmented it. And we include this over here in the last column. That's the Bleu scores without considering unigrams. Because we wanted to make sure that any increase that we were seeing in scores wasn't just due to getting more unigrams right, which there would be more of anyway. And the other reason that we wanted to include this was because it seems like the Bleu evaluation method -- it's funny, we were just talking about this -- is really not particularly apt for dealing with morphologically complex languages in the first place for MT with morphologically complex languages, since if you could have basically a sentential word with a whole lot of morphology on it, and if you get five of them right but one wrong, then you get zero credit from Bleu. So we just decided to go ahead and include that from there. So the conclusions that we draw from this are that morphological information does seem to be helpful in improving translation. But CRFs can be an accurate way of predicting the tag sequences, but we need a better way of constructing the surface form. So to that end we're adding bilingual, syntactic and lexical features of the aligned corpora into the CRF. We're also looking at moving the actual surface form prediction into the CRF. Also looking at automatic classification methods to cluster the unsupervised segmentations so that we can use that with the CRF so as to have a more general model. And then also looking into currently running actually some factor translation models to have access simultaneously to both the word and morphological knowledge and using lattice-based models. And then also discontinuous phrase-based models in order to be able to capture the long distance dependencies between morphemes in the decoder itself and not just in post-processing. So that wraps it up. Thank you. [applause] >> Will Lewis: Thank you very much, Ann. Very good presentation. So we have ten minutes for questions. And you're closest to the mic. >> Yes. Well, I don't know if I need to stand up. But could you go back to Slide 19, I think. So could you explain again -- I think I just missed how you get from the predicted tags to the surface form. >> Ann Clifton: Yeah. So basically we looked -- we gathered all of the surface forms that a stem can occur with in the training data. And -- which was actually 1.3 million corpus in the unannotated data. And then we used a five-gram language model from the training data to pick the most likely surface form for that context. >> So on the second results that you showed where you leave it segmented and run Bleu, the reference is segmented with a supervised model? >> Ann Clifton: No, the reference is segmented for in the case of the supervised model by the supervised model and in the case of the unsupervised by the unsupervised. >> By Morfessor. >> Ann Clifton: Yes. >> I see. Okay. The one thing I would say about that is that -- well, I don't know how do this, but basically because you would have presumably fairly frequent suffix morphemes in there and Bleu may or may not be penalizing them so much for -- you're getting some credit for getting it whether or not it's necessarily used with the right root or something like that. I actually don't know how many tokens you have in your utterance or things like that, but I wonder if there would be some way of building in string edit distance into that or maybe Bleu under various forms [inaudible] up to four grams or something like that would capture enough of that. Do you have a sense of whether that would be worthwhile for something? >> Ann Clifton: That was the idea of leaving out the unigrams. So that one was only measuring two, three, and four grams, that score. >> Yeah, this is a nice talk. First of all, just sort of one side comment, I don't know if you're aware of the work that [inaudible] has done on Turkish speak recognition, which actually has rather similar conclusions to what you're getting. But I do have one question. Maybe I just missed something here. When you talk about stems, I'm not sure exactly what you're referring to and how much you strip off, but obviously in the language like Finnish or Turkish, the affixes can encode information which is to varying degrees more or less contentful. So in the case of a case affix, you might be -- expect to be able to reconstruct that from the fact that you know that this is functioning as, say, [inaudible] case, for example, it's functioning as a direction too or something. You could sort of figure that out. In the case like if you have some sort of volitional marker on the verb, that's much harder to reconstruct. So I'm just wondering if you've -- you've sort of -- it's an informational question, what do you actually ship off, and, B, if you're stripping off a lot of stuff, maybe there's a difference between the different classes of things you're actually removing and would therefore have to reconstruct and if you've thought about that issue. >> Ann Clifton: Yeah. We definitely thought about that. And the guiding principle that I was keeping in mind is I was trying to keep from getting too language specific here. And so I was using -- whether it was from the supervised or the unsupervised segmentation model, I was trying to use that as just giving me a place to chop. Because if I got too much into looking at the function of any like, say, a syntactic function or its lexical correlation in another language of any particular morpheme, then I felt like I would be making a more specialized tool than was really my goal. >> Well, just one suggestion. I mean -- this is on? Yeah. Just one suggestion. You know, this may obviously not work, but you could consider losing cross-language information in that case, so which things actually get expressed by a separate word in the English, so would know -- the auxiliary verb "would" would presumably correspond to a particular suffix in Finnish, whereas something like the accusative case marker in Finnish would correspond in general to nothing in English. So you might be able to use cross-language information to see which things might be important. >> Ann Clifton: So I'm not sure if this is what you're talking about, but I mentioned in the last slide that we're currently adding in syntactic features, syntactic bilingual features, so using it -yeah, okay. >> In the case where you're restitching morphemes, since you're treating the morphemes -- the affixes as individual words, they can be reordered and moved around, just generally go wandering. So what do you do when you restitch things? Is there something special you're doing in that box there to make sure that you end up with real words around not just [inaudible] do together? >> Ann Clifton: We're just -- the only morphemes that we're restitching are when there's a context wherein you have two adjacent word internal morpheme boundary markers, and that's the only time that we restitch something into words. So any other time, if we get it wrong, then our score gets penalized. >> So you just leave the morpheme alone in the output, you don't delete it? >> Ann Clifton: No, we don't delete it. >> I noticed that in your morphological tags you have a part of speech marker which typically doesn't actually give rise to a morpheme. And unless you had a whole lot of part of speech class ambiguity in Finnish, that information is going to be very redundant. It's just -- it's basically a property of the stem. And so I'm wondering if that might actually be artificially inflating your morphological tag score without actually helping you in the output. >> Ann Clifton: Well, that's possible. The reason that we did it that way and we left the part of speech tag as a feature of the stem is because we couldn't afford to retain all of the inflectional categories in the prediction. And so there are certain categories that apply to -- not just to nouns but also to verbs. And so by retaining the part of speech that was able to kind of form a sort of complementary distribution between the tags that it would predict, so when it could have been ambiguous otherwise. So I'm not sure if that's artificially inflating the accuracy scores or whether it's just kind of sort of a shortcut, more means of efficiency. >> Will Lewis: Okay. I think we have time for one more question at the back here. >> Well, it was answered by -- so if there's a really -- Bob, you haven't had a chance yet, so... >> Yeah. I had a little trouble relating your results slide to this slide, and in particular I'm interested in which of the results lines in your results tables correspond to this restitching the segmented output. >> Ann Clifton: This one? >> Yeah. Yeah. Can you show me in the results slide? >> Ann Clifton: Oh, yeah, yeah. That would be the one in bold. >> Okay. So that's the one that performed the best? >> Ann Clifton: Um-hmm. >> Will Lewis: Quick follow-up here since you've been very patiently waiting. >> Thank you. So what we found sometimes when we use unsupervised segmenters like Morfessor or others like Paramor [phonetic] is that they tend to undersegment relative to supervised models, ones where they've actually been built by hand transducers. Do you have a sense of which one's oversegmenting, which one's undersegmenting, and if you chose a different operating point of view, asked the unsupervised one to sort of chop a little bit more, do you think that would make a difference? Do you have an intuition about that? >> Ann Clifton: My intuition about that is that with Morfessor, anyway, by controlling the perplexity threshold, you can control the degree to which it segments. And I found actually that when I had it segmenting more it would overfit to dev set to a point where it performed far more poorly. So we ended up going with quite a conservative segmentation. >> Will Lewis: Great. Well, thanks very much. And thanks, everyone, for a very engaging question session there. [applause]