>> Will Lewis: Well welcome everyone. I'd like to introduce our guest speaker today Philipp Koehn from the University of Edinburgh. He is a professor in the School of Informatics there and basically chair of machine translation. He got his PhD in machine translation from USC. His advisor was Kevin Knight. I consider him a leading light in the MT field. He is best known for the open source decoder Moses which is widely used within the research community as well as in commercial enterprises well. It's basically kind of a go to for a lot of research and commercial work. Philipp directs a fairly large and productive MT team at the University of Edinburgh and has graduated a number of students over the past few years, two of whom I think are coming today. I don't see either one of them, but they said they were coming [laughter], Michael and Avachek [phonetic]. They'll probably be here shortly. He has led or leads or plays a principal role or a key role in a number of research projects within the EU, a number of EU funded projects as well as two DARPA funded projects. Actually he and I were talking at AMPTA [phonetic]; I was asking about some of his recent awards and those of you that have actually applied for grants, and if you get one out of six funded you are doing really well and this year alone Philipp has managed to get three out of five which is a phenomenal record. The only drawback of that is you end up having all of the management responsibilities and having three grants funded when you were hoping for one. So Philipp also manages the statMT.org site which a number of people in the MT field actually have gone to to get data or information about MT. He also runs with Chris [inaudible] and a number of other people. The workshop on machine translation which is co-located with ACL or LMNOP [phonetic] every year. A rule actually calls the WMT the Grammys of the MT world. I think you called it that [laughter]. But anyway, so we welcome the Grammy’s MT day here Philipp Koehn. >> Philipp Koehn: Okay. I don't have as many jokes as at the Grammys [laughter]. I want to give a bit of a talk about work we have done in computer aided translation. So there's a bit of a shift in our research where we finally look at who's actually using machine translation, so I have way too many slides so I'm not going to get through the talk anyway. Please interrupt me as much as you can. I tried to give a version of this talk at the ISIM [phonetic] and we got about one third. I would consider that a success. So please ask me any questions about this work. As Will already alluded to this was also the basis for some of the funding we asked for from the European Union and so we have two research projects on this very topic. That's one of the things that you -- you would've hoped to have one research project, but now I have two which makes us -- no, no on this particular topic, the other stuff too; this was just the start of it. [laughter]. When we committed to build an open source toolkit for computer-aided translation we worked with translation agencies, so this is something I've been pursuing for the last two or three years and now it's definitely going to get going full steam ahead. So I'll talk a little bit more about the work that we have already done and a little bit about what the plans are there. Okay. So just to set that up in sort of a broader context, why are we doing machine translation? So the majority of the research focused on the states has been on assimilation, so the idea that someone wants to have some kind of information that's only available in foreign text, so you do a Microsoft Bing search for some technical term and you only get a Chinese webpage back and you want to know what's written in it. And if you get it translated and is garbled and it's not necessarily the best translation but if you get information out of it, and you are happy. So this is the scenario where the user is thought to be tolerant if not perfect quality and this has been the focus of DARPA funded research where the scenario is some analyst wants to find out some document that's written in Arabic or Farsi or [inaudible] and you can kind of see where that is going. Another application is communication where you maybe are having a multilingual chat application where people talk to each other and that has the advantage that if something gets mistranslated you can always ask follow-up questions, so that's also some room for error in terms of machine translation quality. This is somewhat connected with speech recognition research. There's always this idea of a handheld device if you travel abroad to China or Japan and you have a handheld device you can actually ask for directions and things like that. So and the final application is dissemination where you actually have a lot of text in your own language and you want to actually have that published in other languages and then you are going to basically push these translations onto unsuspecting civilians out there who don't know how it was produced and they are not going to be happy if there are errors. This is kind of where the action is in terms of money being spent on translation. Money is being spent on high-quality application level translation and this is currently done pretty much by human translators. This is not done by machine translation. So this is what I'll try to focus on since -- especially in European context where the goal of machine translation is to deal with this situation where the European Union has 23 official languages and they are going to have one more; they're going to have 24 official languages because Croatia is going to join next year and it's just not going to go away. People in Denmark are not going to stop speaking Danish and they will expect all of the EU level laws to be applied to them to be published in Danish and so on. If you can't beat human translators at producing publication quality translation, join them. I started this out with the question like how can we actually help human translators which kind of leads to the question, what do they actually do. How do human translators actually do it? What are the hard problems for them? Why don't they just write the translation? I need what stops them and what slows them down. So building an MT tool kind of then has kind of be geared to what are the biggest problems in human translation. Okay. So I'll talk a bit about human translation, about how we can then assist human translation, so we built a tool for human translators and we're going to talk about this. And a user study, and the last two topics I'm not sure if I'm going to get to. One is the extreme case of a human translator who doesn't know the source language at all, so a monolingual translator, and at the end some work on integration with translation memories, but I don't think I'll get to that. I'll start with the study. We wanted to find out how human translators work so what we do is we just get a bunch of human translators and observe them really closely, so we did it at the university for what we do at university? We just hired students. So we did French to English because it's a language pair were MT is pretty good and we could get access to quite a few people who speak French at the University of Edinburgh, so there are French natives who studied in Edinburgh and then there are, French is one of the languages that is somewhat popular in England and Scotland to learn so we have also English speakers who learn some high school French, or least claim they know French. We offered them money and they said yeah, I know French and we said yeah, you're hired. Okay. And that's actually good. Some of them didn't actually know French all that well, but that's an interesting data point to how well do they do. So we have each of the students translate news stories from French to English about 40 sentences. It's a pretty easy task. There are some content that they are familiar with, it's just the news of the day. There's no specialized terminology, so if you for instance translate Microsoft manuals then you need to know what mouse click means in French and it's not necessarily a literal translation, so you need to know all of these terminologies, so we didn't have those kinds of problems. And we logged exactly which key they type at which point and how the translation looked like at this point, so we had a really good view of how they produced the translations. So here's an example how it looks like, the data we get out of this. It's a bit of a challenge actually how to visualize that and what information you want to get out of it, so I'll get over it a bit. So this is the keystroke log. So the input, I think this is one character got lost in the logic for a minute, so it's a French sentence and this was the translation that was produced. The manufacturer has delivered 97 planes during the first half, and this is the keystroke log. So it goes over the time axis here, 0 seconds to 35 seconds. The height of these bars is how long the sentence was at that point and the color of the bars indicates what kinds of keystroke was done, so the black ones are just a regular letter character that was a being typed. The purple ones was when the delete key was hit and the grayish ones you may or may not be able to see are cursor movements. So what happened here? For 3 seconds nothing happened or at least nothing observable happened. Then the person started to type happily along with a hesitation here up to 10 seconds. Then 3 seconds of thinking, some reflection, deleted some characters, typed something, moved the cursor around and then started typing again, deleting something again and then just kept typing for quite a while. Then there was a break here and then at the end apparently here the translation was pretty much done. The cursor was moved around; some characters were deleted and some characters added. So we actually had, this is one way to visualize the data. You actually don't know how the translation looked like at any of these points but it's hard to actually show that in this graph. So on one of the research products we are currently doing, we actually have a replay mode where you can run the entire interaction and see exactly what happened at each point including eye tracking, so you can also see where the person looked at the screen. But at the point of this study we didn't have that. We just have the key log, but we have an exact -- we could actually do the replay. We know which character was typed and how the translation looked like here, so if you want to visualize that to. >>: [inaudible] you mean first they didn't [inaudible]? >> Philipp Koehn: Probably. The user could also use the mouse, so is this just a web interface so you can just do anything, but yeah. >>: So you don't know the mouse movement here? >> Philipp Koehn: We don't know the mouse movement. We know, yeah, we don't know the mouse movement. >>: But you have it recorded. You just don't have displayed in this graph. >> Philipp Koehn: We don't have the mouse movement because this is a web interface. The mouse movements don't really mean all that much. Well yeah, you could reposition this and the cursor, so we don't know where the cursor position is at each time, but if a keystroke happens we know how the translation looked before and afterwards so we can reconstruct where the keystroke happened. So we record at each given point here yeah, each event we log what kind of key was pressed, how the translation looked at this time, but not where the cursor was on the input. Although, we could maybe do that, but it follows from the keystroke. And it's a [inaudible] so you could actually do cut and paste actually like control C control V but I don't think people have done that. Yeah? >>: Thinking about that, do you think that you would have gotten different results if you had to use professional translators? >> Philipp Koehn: Probably, yeah, we looked into that much more and one group we work with in one of these European projects has been doing translation studies, translation process studies is the official term, so they actually study translators and what they do and if there is a difference, I think the most striking intuitive or whatever apparent one is how much of the input text do you read before you start translating. So if you have a fresh translator they are all trained to first read the entire sentence and then start translating, and if you actually look at professional translators they don't do that anymore because they've done it for five years. They forgotten all of the good lessons they learned in school. They just see the first three words and start typing. There is definitely a difference in speed obviously and the kind of pauses. I'll have a bit of all of this because we have different types of users in this study. Do you have another question? >>: If your goal was to look at dissemination and publication quality then how do you transition [inaudible] why did you even bother doing the study with people who weren't professional translators? >> Philipp Koehn: Because we had them available partly and because we basically started building a tool. I mean they are still bilingual speakers who can translate but it's not to the level of professional… >>: That's where things differ because bilingual isn't necessarily a good translator. >> Philipp Koehn: No, no I'm not claiming that at all. We also have a bit of a broader focus than just professional translator, so the other idea is -- I'm not sure if I'll get to the extreme case of monolingual translator who doesn't know the source language and just wants to use a tool like this to maybe better decipher the foreign document, but we also want to use this in kind of volunteer translation communities so there are a bunch of communities on the web who translate stuff for fun to produce content in their own language. There's quite a few groups in China that translate news stories into Chinese from the BBC or the Guardian that wasn't officially sanctioned by the Guardian and there is another website that we have started to try to collaborate with global voices where people are like amateur journalists and translate this type of material, so it's a bit broader than just -- I think the point that is consistent in them is yeah, they all have the goal to produce correct translations that are at least amateur level quality opposed to just raw MT. So you don't like the translation [inaudible] either [laughter]. Okay I'll have another slide about translations and then I am interested about your feedback on that. So the question is who uses this. And it's not just professional translators but also the [inaudible]. But where the money is is obviously professional translators and how we can help them and they have higher standards as to what they expect. Okay. A bit more analysis. What can you get out of this? You can observe that people type maybe slow or fast and they make pauses and that's kind of where the action is. So what are the pauses they make and how many pauses do they make? So they make a pause at the beginning, when they read the sentence, and then they might make a pause at the end when they review the sentence and decide do they like the translation or not. And then there's all the different types of pauses and so it's a bit hard to break that down. If it's a short pause of maybe 2 to 6 seconds they just might, I'm not entirely sure of what the next word is, but it's just a hesitation. If it's a medium pause up to a minute they are really solving some problem, maybe rereading the source sentence, maybe reading their translation. There is something a bit more bigger that is kind of causing them to pause that long, and then there are pauses of longer than 60 seconds. >>: Do you have any information about what they did during the pauses? >> Philipp Koehn: In this study we don't. We are very bothered by that, so the work we are doing now also does eye tracking so we at least know where they are looking on the screen. So if you have a pause, that was kind of one of the things that left us puzzling after this. If there is a pause like of 2 minutes, what are they doing? >>: [inaudible] [laughter]. >> Philipp Koehn: There's a good chance of that. >>: Looking up a strange word. >> Philipp Koehn: Yeah, looking up a strange word, I don't know; looking in a dictionary, I mean, that's a big question and that's kind of yeah. >>: [inaudible] translation get some feedback on this one. So short pauses like to 6 seconds you take the suggestions from the translation memory, they might be fuzzy matches which means basically that instead of retyping the whole sentence from scratch you can actually recycle that suggestion and… >> Philipp Koehn: So we didn't have sub session memories here. This is just translation from scratch without any [inaudible] but anyway. >>: Sorry. The second thing, terminology check, so basically there is a need to make sure that it is translated in the system glossaries or whatever requirements there and also to verify consistency with previous translations of the same term. That can take just a few seconds to collect. And more pauses if you don't know a certain term it can take minutes basically to make sure that you are using right thing. That is from my experience, of course. >> Philipp Koehn: Yeah, since you're here maybe I should ask you, ask a translator. So what takes actually the most time? What takes them the most time in translation? >>: I'd say terminology. >> Philipp Koehn: Terminology? >>: So when you don't know exactly what term to use that takes a lot. Sometimes checking grammar like comma, cases, maybe, you know, it depends on the language but there might be some [inaudible] that require you to go into some resource to find whether you need a comma here or not, so that's probably the two biggest pauses. >> Philipp Koehn: Okay great. >>: Name entities, checking, getting the name of your customer right. >> Philipp Koehn: Okay. So that all kind of depends on what kind of tools you build and these are all things that we could build… >>: Do you have any information where in the sentence that they pause? Always in the beginning of a new clause or… >> Philipp Koehn: We do have that information. I don't analyze where that pause happened temporarily. Besides I make a distinction between at the beginning and final pauses, but we have all of the data. Actually I've posted all of the data for this on the web, and if you want to dig into it deeper -- it feels like this is like massive amounts of data. You have every keystroke that happened and when it happened down to the microsecond and what kind of questions do you have and how can we get kind of -- we always want this one number answer for everything and you have like millions of numbers and so how you distill it all is not entirely obvious. >>: [inaudible] English speakers don't line up exactly. Sometimes [inaudible] because [inaudible] of the sentence. [inaudible] >> Philipp Koehn: Uh-huh. That's also, I guess sign language translations are very similar to simultaneous translations where you have to translate speech while it happens and that's a bit of a different scenario, but it's yeah, people do those kinds of translations very differently, like they don't change word order much because they have to kind of spit out the words when they come in and they can't wait much. Here they have more time to think. They don't have to do a real-time. They usually take more time to produce this translation than you expend on just talking. >>: [inaudible]. >> Philipp Koehn: If you didn't pay them, yet we paid them essentially by the work here, so we just gave them a flat amount of money for translating all these sentences. >>: As a professional translator, the more you do the more you get paid so you have an incentive. >> Philipp Koehn: The incentive here is clearly to be fast also, so that makes sense. We don't pay them by the hour. Okay. Here's a big table where time was spent. Okay that's a lot of numbers. I will go gently over those, don't worry [laughter]. You don't have to memorize all of them. Before I get to that, we have these different groups. We group the translators into two different groups, the ones that are native French and the ones that are native English. In a professional translation scenario the standard thing is you translate into your native language, so you need to know the language you are translating into and the language you are translating out of it is the one that you learned in school or whatever. So we have here the L1s which are the ones that are native French, so these are the ones that you wouldn't normally hire and the L2s are the native English. The total time is the time per word. Another standard way to measure this is how many words per hour. That was kind of the inverse of each other. So these are on average people translate between 500 and 1000 words per hour which is relatively fast by professional translation standards. >>: How much? >> Philipp Koehn: 500 to 1000 words per hour. >>: That's a lot. >>: That's a lot. [inaudible] translation memories or… >>: It's 3 to 4 times as much as normal. >> Philipp Koehn: Yeah, but it's also news, so it's not difficult. All of the terminology brought up, it's not necessarily a problem because there's no fixed terminology to say Secretary of State or… >>: There's another criteria. Languages like German the words are not used as a measure of productivity. They use lines because words are combined [inaudible] [laughter]. >> Philipp Koehn: [inaudible] characters then. >>: [inaudible] languages. >> Philipp Koehn: Yeah. So this is the number of source words they translated and then we divide the total time by. So you see some variance from the fastest ones are 2.8 seconds so this is really fast. This is more than 1000 words per hour and the slowest one was 7.7 seconds. So let me try to allocate the time on activities and we had like these different kinds of pauses and keystrokes, and if you are really strict about it a keystroke doesn't take any observable time, press a key and that's it. It doesn't even take a second; it takes no time at all, just a point in time. So the way we define this [inaudible] keystroke the second before and the second after its part of the typing activity. So we kind of break down all these typing activities and only if there is no activity for at least 2 seconds and it is not part of the ending interval of typing and the beginning interval of typing and it's time in between there is part of the pause, so this is how we then defined these intervals. So there is a typing period and then there's a pausing period and then there's a typing period and there's a pausing period. Okay. And then based on these intervals we can allocate how much time was spent on these activities. >>: And then you normalize the numbers [inaudible]. >> Philipp Koehn: Yeah. So you just measure all of these times and then just at the end just to make it kind of numbers you can compare, so if you have this total time spent on the first translator, 3.3 seconds per word, this is kind of how it breaks down and where they spent all this time. So let's just go over that. Not much time was spent by these translators on the final pause. Once they were done with the translation they were happy and they moved on. They didn't spend a lot of time rereading it. They didn't spend much time on these short pauses. These 2 to 6 seconds and they also were not really all that much different on how much time they were typing, so there was not a big difference between slow typists and fast typist. The big differences between the translators were how much they paused. Medium time and long time pauses, especially the big pauses if you want some translators who didn't have any big pauses at all, they never paused for more than 60 seconds. They just always kept typing. And the second translator was probably the worst translator in terms of speed, spent 2 seconds per word with big pauses. So this is where the good and the bad translators kind of differ. How much you have to think about the whole thing? >>: Do you have a corresponding table [inaudible] of the output? >> Philipp Koehn: We'll get to quality, yeah, we'll get to that. There was a concern and yeah, so I'll break this down a bit more in the next 10 or 20 minutes. >>: The usual [inaudible] for translators is considered 150 to 300 words per hour so the second translator is actually pretty much within those boundaries, so everybody else seems to be translating much faster than normal translators [inaudible]. >> Philipp Koehn: Yeah, so 300 words per hour is 12 seconds per word, so these are all [inaudible]. Again, the question is how good is the quality and how difficult is the task. I think that the task is not that hard. When I do this to a quality level to my satisfaction I get similar speed because it's news. This not so much specialized vocabulary and so on. Okay, I think I'll skip this one here. You can also have a graph like this where you can track how much time was spent on the pauses of a certain length, but it's going to take -- if you get this formula here, you just accumulate and you're going to add longer and longer pauses and see how much total time was spent and you get something out of it. So this is a person who spends a lot of time, so it's not really much slower if you just consider short pauses in bulk, but they spend a lot of time in these long, long pauses. It's just a pretty colorful graph, but yeah, it's a bit of a challenge how do you actually visualize this, so is a rather arbitrary distinction until like up to 6 seconds of this and up to 60 seconds of something else, somewhat questionable. Okay. I dig in a little bit more into all of these numbers but I first go to the main point of all of this work was how to build type systems that assist human translators, and then I'll dig more about how they will help and how much time they spend in the quality of the translations and all that. Maybe a spoiler, the quality didn't differ too much -- no, I'll get back to that. Quality is the issue and we'll get back to that. And the translators differ in quality; I can say that. So we tried two different types of assistance. One is sentence completion which is kind of an auto suggest kind of facility so the translator types in the translation and the tool makes suggestions. So the next word should be this, the next phrase should be this and the user can just accept this and then it just produces that section. So it's one phrase at a time, so what does phrase base mean? If you know phrase base MT that's what we mean, so these are our short engrams that are used by the phrase base model, so it's a reflection about how the phrase-based model produced the translations and so it just spits out these phrases with the phrase-based model. Translation options, is also something very closely tied to the way the phrase base MT system works, so it gives suggestions for the single words and phrases and ranks them. I'll have a visualization of that and the third one is kind of the default. If you don't do anything smart about integrating MT, you just give the human translators the MT translation in the beginning and say fix it up. So this is kind of creeping its way into the industry where translators get increasingly confronted with instead of translating from scratch or from translation memory, now picks up MT output and they are not necessarily happy about this. This is actually a tool that is online. You can try it yourself. So this was developed in Ruby On Rails and Ajax 2.0 and all kinds of Web 2.0 and MySQL and php and no, this was written in Ruby and its spec end is a Moses machine translation engine. Go to the website and try it out. The browser compliance is not as good as I would've hoped at this point. It works best on Firefox on a Mac, but some of the formatting things need work, but I'll demo it a bit later. It's just there are probably a few things I should fix up. We are not going to build this tool much farther because in the new research project we collaborate with other groups who also have their tools and we decided to just start from scratch and build new tools and so it is as it is for now. So this is how on a very, very short sentence, just a headline this sentence prediction looks like. So you have in green input sentence, the headline of this new story. You have a text box that is in orange. It's just a regular HTML text area. You can do whatever you want in the text area. And you have a suggestion of what the next word should be so this is in red. It's not rocket science, so it just comes up with Newman and you can type and it's going to make the next suggestion. So the user accepts this by pressing tab or they can just type in their own translation. So if they are typing in a different word, the tool kind of thinks about it again and makes a new suggestion. So there was a project 10 years ago, TransType that was done by Canadians in [inaudible] research and some other groups in Europe that came up with this originally and there were some people who kind of this strand of research alive, although it never really made a huge breakthrough and we are now kind of trying to revive it a bit. Okay, so how does this work? So we first run the input sentence through the machine translation engine and we create the search graph, so we have an entire search space that was explored by the machine translation decoder and we try to match what the user typed in against the search graph. You could also just rerun the machine translation with this prefix but that would be too slow, so this is something where you don't want to wait at all. So if the user types in some characters you really can't even wait a large fraction of a second, so it has to be really, really fast coming up with new suggestions, so there can really much of a wait period. If it takes a second, that's too slow, so running the MT engine, we wouldn't have done that with our kind of machine translation engine back in the day. So that's why we operate on search graph because going on with the search graph is much faster. So yeah, there are two criteria. If you want to find the minimum edit distance matched to what the user typed in, so we might not have what the user typed in in our search graph and so we want to find something that has minimal string edit distance, that's the number one criteria. If you have multiple paths, the same minimum string edit distance, we take the highest scoring parse, so the highest probability parse. So the search graph is precomputed and stored in a database and matching is done on a server and the browser kind of makes these requests. Typically it takes less than a second and usually it's much faster than that. I could at some point I'm going to demo this. Let me just go through all of these things that you're going to see on the screen before that. Okay, that was number one. Number two was the translation options, so you have the same input sentence here and you have displayed the top translations according to the model. And it shows you word translation and phrase translations kind of mixed up, so the top line is the word translations and then the phrase translations. And the user can just click on these so you don't have to retype them. You just click on all of this and you build your whole translation [inaudible]. How does that work? It also works very heavily on the phrase-based MT system. There's a phrase translation table and we score these phrase translations with not only translation probabilities, but also estimate of the language model cost for all of them, so it's a lot like the future cost estimation in phrasebased MT we try to figure out what are the easy parts of the sentence or the hard part of the sentence so your search doesn't get lost and does only the easy part first because that looks kind of most promising. So it's very similar to the outside cost estimation. This is how the tool looks. So let me try -- this is a really big screen, so this is actually going to work pretty nicely. So this is how it looks for a French sentence. It was very short but it actually fits perfectly on the screen. On the left you have the source. In green is the sentence I am currently translating. In the middle is just information, the raw MT output, and on the right is what you already translated. Also, I just deleted the translation so you kind of build up this part here. It's probably not the best way to structure this and we are doing it differently in the new tools we built, but the main thing is down here. So you have the source sentence. You have the string edit difference to the post editing and you have here the text box, so here it proposes that you start with Sarkozy and you can just say Sarkozy at the meeting of fishermen angry. Okay, that's not a great translation, but you kind of see how these things kind of came up. If you want to actually do something different so maybe at the meeting this -- I don't know. Any other suggestions? Anybody know French? At the meeting, maybe with angry and then hopefully it's purple if it's fishermen. Oh yeah, it does. So you kind of see how it -- you don't like -- you are not happy with the translation. >>: [inaudible] at the meeting, just meeting [inaudible]. >> Philipp Koehn: Oh just meeting and then it should say angry fishermen and it doesn't [laughter]. Maybe let's see a demo of when I click on something. I can click here on angry and it pops in. It's somewhat smart about upper casing the first word in the sentence and just adding commas and periods and all that, but it's not super perfect. Angry fishermen and then it's done with it. >>: [inaudible] stop correcting? >> Philipp Koehn: Up so you can do whatever you want, but the prediction is set up that you do it left to right. This is a text area. If you want you can just type fishermen and then you can go whatever you want and you can even cut and paste things. I mean you can do what you want. So you can use the tool. It doesn't hinder you from anyway you want to use the tool. It doesn't force you into any operation, but to get benefits from the prediction -- I don't know what happens down here. It is of course a bit lost. You actually see the shaded out? This is what the machine translation system thinks you have currently translated, so it thinks you already translated Sarkozy with that first scribbling stuff here. >>: [inaudible] fisherman [inaudible] produces [inaudible] >> Philipp Koehn: Yeah, [inaudible] with some idea about where [inaudible] paste this [inaudible] where did this come from? And its best hypothesis is it got Sarkozy wrong and [inaudible] is the right word so it's a substitution, so there is a string added to this one in the search graph. There's no way -- you have to do something with that, so it's going to be an error of one to the search graph because you have a new word and one word in there so it tries to come up with the best explanation which in this case probably is substitution with Sarkozy. Okay. So that was a lot of fun. Back to the talk. So the final thing is post editing in MT and there's really not that much to say about it. You just sketch in your text area already pasted in the MT output and you can fiddle with that anyway you want and in this bluish area it gives you kind of a visualization of the string edits so you kind of see which words you deleted and which words you inserted. So this sentence corrected with a few things so it's -- MT output had them as an interpreter; an actor is probably a better word in English and there was also years in there and it mistranslated the name of the title of the movie because I guess in French it's just called and the Kid and not the Sundance Kid. You have to get the proper English title. So we have the same set up here now, so we have 10 translators. Actually, the study I reported earlier was just part of this study. I actually didn't do the first study or the separate study, just we had to have them translate these 40 sentences under different conditions and one of them was without any assistance. Same people and we have five different conditions. The unassisted is what I originally talked about where they didn't have any help at all. They just had the text area and that's it, nothing else on the screen, just the source obviously. The first thing we tested was prediction, so this was mostly demoed with making suggestions, the options where you can just click on all these words or both of these things. And the second condition was post editing. So they just had the output of the MT and they could fix it anyway they want to. So they had blocks of 40 sentences under each condition and each translated them and we rotated things around that each text block was translated under all of the different conditions by different translators. Obviously no translator translated the same block twice under different conditions because they would have known already too much about it. So we are concerned about quality. We are mostly concerned here about speed. We want to have faster translators. We don't necessarily want to have worse translators, so we thought we were just going to do a very simple quality assessment where we just ask them afterwards, judges saying is that a right translation or not. Because you saw that these were human translators, they should be 90% right. It didn't turn out that way. So we just asked them is that a fully fluent and what's the word, the wording in here? Fully fluent and meaning equivalent translation of the source? Sounds like a straightforward question. And we showed it also in context so if there was some confusion about, you know, pronoun it referred to and so on, you could figure it out from the context. Okay. To our surprise we got about 50% correct and I'll show you the slide with one sentence and we can argue about if the judges are too harsh which was my impression, or are all of our translators really crap. Some of the same students and other students did the judging, just also nonprofessionals. Just to give you one example. >>: So do you have the quality judged by the same guys that produced this? >> Philipp Koehn: Yes. >>: What difference [inaudible]? >> Philipp Koehn: Some of them -- there was a mix of people but there was some overlap between the two groups, yeah, so they didn't know who produced which translation. Maybe they recognized their own translation, and maybe not. >>: The groups could be reviewing their own translations? >> Philipp Koehn: They could have reviewed -- yeah. So they were actually given -- let's see how we actually did that. I think we showed them all the 10 translations like here, the different translators under different conditions and had to say about each one was it correct or wrong. So there's a concern if they see their own translation and say that's the one that's right and everything else is rubbish. But in the end… >>: The blind is judging the blind. >> Philipp Koehn: Yeah, yeah. [laughter] I'm not going to go down that road. Anyway so, here's a sentence. So this, maybe it's a somewhat French sentence. It started the MP system which came up without dismantle it has been concise and accurate. So without dismantle is kind of the biggest struggle here in how to get that right. And these are the different [inaudible] you see here who did it, again the L1s were the ones that were French native and the L2s are the ones that are English native and these are the different systems. The first striking observation that is always stunning is these people were very much prompted by the MT system, by them showing them translation options. They had a very kind of similar kind of mindset on how they should translate the sentence and they all came up with very different translations. If you give 10 people a sentence to translate, they all come up with different translations even for short sentences like these. Even in this scenario where two of them are just post editing from the same source, and others were shown like all these options and they all were kind of trying towards certain vocabulary to use and all of that and they still all… >>: What is good translation and all… >> Philipp Koehn: Oh yeah, so my impression -- just to finish the description of this slide, the first number is in green how many thought this was the correct translation and the red is how many thought it was a wrong translation, so what's -- let's start with the third one. Without fail he has been concise and accurate. >>: [inaudible] >> Philipp Koehn: There's the first one that three people thought was not a good translation. I don't know French, but just from the general context of the article I thought that was actually not a wrong, not a bad the translation. I mean if our MT system would produce that we would be perfectly happy. Without getting flustered he showed himself to be concise and precise. Everybody liked that one. Sometimes, yeah, it's human judges also so you have [inaudible]. He showed himself concise and precise and so this first weird thing was just lobbed off [laughter] and two people said yeah, that's all redundant; it doesn't mean anything. >>: The native French speakers and -- how good was the English of the native French speakers? >> Philipp Koehn: So they were university students at Edinburgh so the English was good enough to attend university classes so it's not… >>: Were they taking English lessons? >> Philipp Koehn: No, they were just regular students at the university. They were not taking English lessons. >>: [inaudible] enrolling at the University [inaudible] at the same time. >> Philipp Koehn: Yes, so here is the output so you can judge -- I mean some of these, I think the L1s are the ones that were produced by the French native speaking so there are problems with the grammar. There's also always effort in how much did they really try. Post editing you can always just say yeah, whatever. I'll make my money; I'll just say yes to everything. But here are the different translations. >>: [inaudible] each one predicts there, each translator produces their own translation. There was no convergence. There was no two people producing the same… >> Philipp Koehn: No. And that's absolutely, you know, that's absolutely standard. So this is absolutely typical translation behavior. I don't think you'll find any sentence where maybe on a very short sentence sometimes two people came up with the same translation, but then the other eight came up with different translations. >>: I think that's one of the areas where you might find the difference between professional translators and amateurs. If you have people who are used to working for say Microsoft, when we did technical documentation we strive for standardization. >> Philipp Koehn: Yes, I think for technical documents I think there is a more, there are much more guidelines as to how you formulate things and what tense you use and you know and it's standardized terminology. >>: Mostly yes, but we’re getting away from the very structured and dry language. >> Philipp Koehn: But I also hear stories from someone who works at the European Parliament or the European Commission with human translators and we hear the story where sort of showed this translation to someone and he said this is all wrong and this should be changed and they said yeah but that was your translation from yesterday [laughter]. >>: I still find it surprising that this priming from the MT output. The engram overlap with the second clause there is pretty strong for a number of these, so clearly this is where the translator got it better they tended to keep that. Whereas, if the translator got it wrong that first clause there, that's where you see the biggest divergences. >> Philipp Koehn: Yeah. So some of them -- so if they have been concise and accurate, especially if you did post editing; did they keep that? He has been, so there is post editing, showed himself to be concise and accurate. I mean even there they changed it may be more than necessary. I thought these translations were -- one thing I wanted to stress here. I didn't think that these were all that bad, but we [inaudible] get all these numbers with accuracy of 50% and that's where they otherwise come from. >>: [inaudible] did the student previously before doing that [inaudible] >> Philipp Koehn: No, we were just saying try as best as you can. You get a fixed amount of money to do this and they were very well-paid actually and yeah, try to produce good translations. They were not say, use the tool as much as you can. They were not said, you know, when you post edit, don't just delete everything. And you will get into the behavior in a bit more detail, so people do different drastically and some people just basically didn't use the assistance we offered. They just always typed in the translation and completely ignored all of the other options that were given to them. >>: [inaudible] >> Philipp Koehn: On each of these sentences, yeah, this is probably a pretty average sentence here. Average time was three or 4 seconds per word so this is a 10 word sentence roughly so it was done in less than a minute. >>: [inaudible] variations between people [inaudible] unassisted and… >> Philipp Koehn: I'll get to that. I'll get to that. Yes, so yes, that was the main point, you know, how much faster they were with these things. So this is just, yeah, this is human reproduced translations and even that is judged harshly. >>: [inaudible] where they do better in quality than the ones that do… >> Philipp Koehn: Yeah, I'll get to all that, yeah. So that actually was the main point, so I'll get to that now, quality and speed and with the assistance and without the assistance and all that. This was just too stress the metric that came up so don't yell at me like I only got 50% of sentences right. Well, this is what that 50% means. Okay. This is kind of the one way to summarize this. So every speed over all translators and then broken down by different kinds of conditions. Unassisted 4.4 seconds with post editing 2.7 seconds. Given the options, 3.7 and prediction 3.2 and both of these 3.3, so they are all faster under all of these conditions over… >>: So that is like really surprising, right? Because everyone hates post editing [laughter] so [inaudible] you get from probably half [inaudible] [laughter] post editing is a pain in the neck and slows the down and blah, blah blah, and so before you put this slide up I would've predicted that all of your other options would have been better in fact than post editing. >> Philipp Koehn: Yes, and you wouldn't be the only one to think that because we asked -- I'll get to that to the end but I can say it now. We asked them afterwards, what did you think was more helpful? What did you think made you more productive and they didn't like post editing [laughter]. They sat, the settings are really crap, but if you actually look at the raw numbers, oh… [laughter] [multiple speakers] [inaudible] >> Philipp Koehn: 400. So this is not now, yeah, how many words are this and how many judgments are there? Yeah, I'll get back to you on that, yeah. >>: There's another thing, you know, these are not professional translators. They are just regular students. They would first of all not be that good in terms of quality even with assistance. The [inaudible] would be much better unassisted [inaudible]. >>: I'm actually not even concerned with the quality. I totally agree with you on that. I'm just looking at the speed. >> Philipp Koehn: Quality, by the wayside [inaudible]. >>: Before you put the slide up I would have predicted that the speed would be much better with the three sort of assisted options as compared to post edit. >> Philipp Koehn: And that would've made us happier too. >>: But you only did 40 sentences per person, right? >> Philipp Koehn: Yeah. >>: And so these people were not only trying to do this task, but learn a novel user interface with a lot of complication, so are you going to address the learning curve? >> Philipp Koehn: I'm going to show 10 more slides, exactly that question and what kind of people they are and all of that, yeah. [laughter] so first of this is just another way to break things down. >>: Do it all at once? [laughter] >> Philipp Koehn: Do it all at once [laughter] and then I'll have a bit more breakdown of this. So this is kind of all-in-one, so these are the 10 different people and green are the boxes where they are faster and better with assistance and red is where they didn't, well here, well this one was slower and worse. And yeah, this is the raw number. The difference here they were slow on or worse and here, yeah, the red ones were where they were slow and worse and the white one where they improved in one and not in the other. Okay. There's a bunch of green here, but let's just look at all of these people and kind of break down. So two people we would characterize as slow translators, so these were the slowest ones to begin with and they were also not very good. So this person here 10% of sentences correct, so it might actually not known these languages all that well. So they improve drastically, so if someone doesn't know how to translate very well either because they don't know the source language or the target language very well, they are greatly helped by this tool. These are people who were not as slow as the previous ones but still rather slow and they definitely got faster but not necessarily better but they were around 50% to begin with, so that's not much change. So they were faster instead of better. Two of them were fast translators to begin with and they got even faster and better. We had four people labeled as refusing, we had the keystroke log. We knew what they were doing and they never pressed tab. They never accepted any of the predictions. They never clicked on any of the options. They might've looked at them. We don't know that, but they just didn't use the assistance. So it's not a total surprise that the systems didn't help them. So these are the people where they actually, if you look at the logs they were the same speed with both prediction and with the options and with an assistant because they just didn't use them, so the only thing they couldn't get away from was was the post editing because when they opened the sentence it was already in the text box. So at least they had to do something about it. Maybe delete it in one big goal but if you look at the breakdown on the numbers they didn't do that, at least not always. And they got, so all of the errors point down, so they all got faster. So even these people who were not totally convinced by all of this newfangled technology and also did not like the post editing all that much got faster with post editing. The quality got… >>: One of them, two of them got better. >> Philipp Koehn: And this is a champion translator here. This was one highest percent right was the fastest, almost first one, but one of the fastest for sure, and the errors go in the right direction. >>: And so who are these optional [inaudible] random order? >> Philipp Koehn: They were given [inaudible] so they had worked on one block at a time. >>: So post editing, [inaudible] translation by [inaudible], followed by post editing, followed by because it may have gotten acclimatized to the task and could have… >> Philipp Koehn: So they were not given any specific -- so the way that it was presented to them in the tools they were not forced to do it in the same order as we presented to them so they had like a list of do this task or this task or this task. The order was kind of mixed up for different people but they did one condition all the way through so they interpreted the entire news story under one condition and then they went to the next news story and that may be a different condition or the same condition. So it wasn't completely randomized but -- okay. So we have some more analysis on this answering some of your questions. We did this of course because we have this keystroke log and there's no one person who did the sentence prediction so we have a new color, red. Red is pressing the tab key, accepting the prediction of the MT system and this is a somewhat representative way of, one way to use this prediction. So you kind of do post editing and key but you do it in a way that you produce the sentence as you read it and only when you then don't like something anymore, so you just up to here it's exactly one best translation of the MT system but it was just kind of read through. At this point the person was like okay, okay something went wrong; I don't like this. Some deletions, adding a letter, a lot of moving around here like deleting and adding and so on. Than a pause and then the tool kind of makes new suggestions and the person accepts them all and maybe even until the end of the sentence and then does a revision on that part as well. So it's kind of doing post editing but it's a bit more interactive since the user controls what kind of pops up in the text area. And certainly this continuation here might be better than what was in the original MT output, or definitely better suited. Okay. This is how much time is spent on these different activities. This is now just the one user average over everything, so what do we highlight here? This is how much time they originally spent on typing, so this of course goes down when you don't have to type everything anymore especially in post editing where you have to type less or even in prediction. And how much time was spent on the other activities, so a slightly less, so this one it's a time-consuming part of translating is actually typing in the translation, so you reduce that time a bit, but the biggest difference is the time spent on pause went down, quite a lot. So yeah, it reduced a lot of these pauses, especially the big pauses, so this was one that spent like 2 seconds on very long pauses and that just didn't happen afterwards anymore. Okay. This is another one was so that person really didn't use assistance all that much, so they spent less time on tabbing and clicking and I'm not sure if it shows that but you can actually see where did the characters come from in the final translation. Did they come from any of these prediction activities or the actual MT or were they typed in? You can look at that too. And in this case when both of these options were given one actually never clicked and never pressed tabs no times we noticed at all. So only in post editing the person was forced to use the tool and spent dramatically less time typing, not totally surprising I guess. Spent a bit more time on these pauses, so you have to read chunks at a time so this is pauses between 6 and 60 seconds, so you have to read maybe for 10 seconds, 15 seconds something that shows up here. So overall slightly faster for this person. Yeah, this is what I just alluded to. This is actually reflection of where the characters came from. So if you can just kind of follow-up all of these typing events and tapping events and clicking events you can actually track each character in the final translation and you can actually figure out where this character came from. Was the character typed in? Was the character generated by clicking on one of the options? Did it come from the original MT and so on? So in post editing this person changed 18% of the characters and that includes deleting something and typing exactly the same thing again. And you can see where the other characters come from. The ones typed in were the highest here and also [inaudible] low for the others. Okay. This is the second one. We know he didn't use these options; 100% of the characters were typed in. Also, when these options were given these types of assistance were given he had used them only partially. Yeah, this is the, we had paused graph. We can just ignore that. Here this is the question you had, learning curve. So we didn't have a training phase, but they used it continuously for over 40 sentences which is not a huge training period, so they may have spent maybe an hour or so. I should figure out the math exactly, but they probably spent an hour or so on each of these conditions. >>: That is fascinating because you would think that the learning curve would be more for the prediction and the options because that is actually something there to learn because like hey, I've got these options. These predictions, but the learning curve is actually better for post editing because when they sort of lead you would think that there is the least to learn there. >>: Unless the [inaudible] overtime they have to do… [multiple speakers] [inaudible] >>: No. You learn what kind of mistakes the system makes and then it's very easy to [inaudible] for example, if the insert article [inaudible] very easy for you to [inaudible] [multiple speakers] >>: But in terms of the tool there is actually something there to learn in the tool. I mean you're kind of learning what tab does and learning some suggestion and some stuff and some clicks. >> Philipp Koehn: So that's clearly at the beginning. First five sentences there's a dramatic, the finest. >>: [inaudible] [multiple speakers] >> Philipp Koehn: So what kind of drops out of here is that if you translate a story maybe the first sentence just takes longer because you're just not in the mindset of the story and it takes longer to pick out and then three sentences later vocabulary repeats itself, so these were multiple stories but they had different lengths so there's no like middle bump because after 20 seconds the next story stops because the stories all have different lengths. >>: What about [inaudible] at the and there, or in the middle? Were people just getting bored and spending lots of time [inaudible] >>: [inaudible] new story. >> Philipp Koehn: I don't know. I cut it off here because the shortest kind of aggregated story length was 31 but it kind of keeps -- so they translated 40 sentences under each condition but these might've been multiple stories and they were different stories so it wasn't always exactly 40. It was sometimes 45 and sometimes 32, so apparently 32 was the absolute minimum block of stories so these curves go a little bit longer, but then I don't have 10 numbers anymore to be [inaudible]. >>: Do they stay sort of flat or do they continue? >> Philipp Koehn: That's a good question and we should do more work. So we should probably do a proper study of how this actually works with a proper training phase and all of these things. So this is just some glimpse on -- I think important, the one thing I learned from this is that this didn't really change all that much. So the unassisted, yes, you have to maybe at the beginning of a story you are slower, so you drop from 6 seconds to 4 seconds, but then you kind of stay there. While with the assistance, at the beginning you were not that much different but then it kind of keeps going down, so there is a training effect that if you would do, maybe your numbers would look more optimistic than what I presented. And yes, usually these should be better trained and all. Okay. I already talked about this user feedback. We asked them very quick question here, in which of the five conditions did you think you were the most accurate. And they didn't, they liked all of our stuff. They didn't say unassisted anywhere there, so [laughter] or your tool was just distracting me from the right answers. And rank them on a scale of 1 to 5 by what you thought was most helpful and here we didn't fare so well. And that is, that doesn't match the empirical results. So this is pretty clearly -- the first one you could still like argue about whether that means accurate, but the system like what did they think was the most helpful. And least when we define the goal of this study produced translations fast, produced good translations fast, the answer would've been post-editing, but if we asked them what you think… >>: [inaudible] communities and all of that the goal is to produce the maximum number of translations. >> Philipp Koehn: Isn't that the same thing? >>: No. Because if people, if your tool is annoying enough that people drop out then… >> Philipp Koehn: That's a good point. So that's a really, really good point so this is, so just to give you an actual fact in that space, so if you do some simultaneous translation from speech you do that for 15 minutes and then you have a break for 15 minutes or so. Because it's just too stressful so you can't keep doing this. So maybe this post editing is just much too stressful that if you do it for half an hour you just really, you know, going to have to take a half an hour break and this isn't something we measure because they could stop at any time and you just always measure time on a single sentence. If we just turned on the browser just went back an hour later, we didn't know that. So that probably requires a bit more study on the cognitive load and maybe what's annoying and so on. So… >>: [inaudible] also what you presented was something that you are forced to edit right away that is far I don't know less user-friendly than if you actually get options so they can customize the [inaudible]. >> Philipp Koehn: And that totally reflects my experience with it. Is just much more fun to build the translation even if you just do this tapping, you actually do post editing, but you have full control over what pops up in your text area and you're not just confronted with these 30 word junk that you have to go over. It's a very, very different mindset of doing this task. One is you feel like you are creative, if you build a sentence you can weigh nuances and on the other one you are just basically fixing some other people's mess. And it's not even other people; it's a machine’s mess. >>: You made a good point about [inaudible] so that made the which option works best very drastically by sentence left. Because really long sentences post-editing might be much harder than shorter sentences because it's worse or it starts… >> Philipp Koehn: You could figure that out. The prediction stuff works less and less well the longer the sentence is due to the technical problem of having to match the user prefix to the search graph. If you have too many edits then the search for finding the minimal editing parse is getting actually too slow to be used. >>: [inaudible]. >> Philipp Koehn: Yeah. So the string edit isn't sitting on the right metric really to do this because it doesn't account properly for moves, but then if you do anything else a computational problem of matching -- if you use something like TR, what does the minimum TR edit cost of a 20 word prefix to a big search graph? >>: It's sort of orthogonal because you can incrementally compute these things and just -- I mean if you look at [inaudible] words that [inaudible] and store [inaudible] programming [inaudible] this stuff. >> Philipp Koehn: Yeah. I haven't talked about the algorithms yet. It doesn't obviously do incremental dynamic programming, but it actually still -- for the price base it works. One thing that I'm struggling with is do the same thing for a tree-based model [inaudible] forest where you have to match user and input and the string editors against the forest. It's not a trivial problem and it's still not as fast generally as I would want it to be. It takes -- this is implemented in C so it has to fast. >>: [inaudible]. >> Philipp Koehn: Of yeah, basically that's kind of what you have to do. But if you have any views on that… I can show you where I am. >>: [inaudible] the graph wouldn't be any faster really then [inaudible]. >> Philipp Koehn: Of yeah, at some point really you should just do – re-decode the sentence with forest decoding. >>: [inaudible]. >> Philipp Koehn: Yeah, then you would actually, the graph matching would be easier. So for a long sentence we should do that but that's kind of just from the machinery of setting those up it gets trickier. >>: Sorry if I misunderstood, but can you populate the prediction simply with the next word? Instead of showing the post edited just always show -- you don't always show -- you don't compute the minimum path but you just show the next word that the machine would translate at that point, just take the post edited. >>: Well, that is the suggestion. >> Philipp Koehn: That's what the suggestion does. >>: [inaudible] suggestion [inaudible] >> Philipp Koehn: You look at the… >>: But you are talking about finding the path which is, how do… >>: How to know what the next word is, that's the answer. >> Philipp Koehn: Let me do this here, meeting. So it has to match, Sarkozy meeting. >>: I'm just thinking of getting the machine translation once and then just kind of, you know, showing the next word so that you are effectively post editing but you are… >>: So I think the issue is that if you look at the MT output here, it says Sarkozy at the meeting of fishermen angry. >>: Right. >>: And so if I said Sarkozy at the meeting of angry then it would say, the next word would be angry. >>: Angry. Right. >>: So if it were a monotone problem and not a reorder problem, then… >>: You are right, sorry. >> Philipp Koehn: Yeah, well anyway, yeah. >>: [inaudible] make your other slide viewer on [inaudible] feedback. Do you distinguish [inaudible] options would be different if the students were native or non-native. >> Philipp Koehn: It could be. So it's all reported in the journal paper. We might've broken it down and maybe not. But then it's sort of a small sample if you get on to that level because here it's 10 people you average only after your five, so it's really only individuals and any statistical significance flies out the window. Okay. We have 12 minutes left. I promised I'd get to the first one third of the talk and I'm happy with that. So let me just summarize this as far as what we just discussed. So people got faster under all of the conditions and we especially reduced the big pauses, be reduced a little bit typing effort and post editing for sure. They made them slightly better although I don't want to make any big claims about quality because I'm not happy about human judgment on this one. I blame the judges. Even the good translators got better with the post editing and I think that's somewhat good news. Some of the good translators, so we had four refusing to just use a tool. That's the general problem if you give people a tool and are used to one way of working it then why would they actually change the way they work. And there were the fastest and to some degree also the best with post editing but they didn't like it. So now there are two ways to go from here. One is to make our systems better so they catch up with post editing or to make post editing more fun. We are trying both of these things. So there are various ways we get to improve post editing. >>: Question, what is the goal? Because if the goal is to get a better way to assist the quality because in the end this experiment what you got was not necessarily usable [inaudible] 50%. If that's an acceptable I [inaudible] how good the software [inaudible] >> Philipp Koehn: I think that number is too low for performance success. >>: I think that was being too harsh. >> Philipp Koehn: Yes, way too harsh. That's my take; it was way too harsh. But it's always easy to find a mistake in the translation I think. >>: But you could change the question too. >>: Or you could use a wonderful scale. >>: [inaudible] >> Philipp Koehn: But it still, yeah you always [inaudible] >>: You can do a blue score against the other nine translations [laughter] >> Philipp Koehn: So yeah, so I actually believe that they are going to say 90% is fine and here just the person was too lazy or didn't pay attention or just didn't know the language well. Otherwise why would they actually get anything wrong? They are human translators. They are perfect. Isn't that our goal really, to be worse than humans? [laughter] >>: Oh, we think you are really bad. >> Philipp Koehn: Yeah, yeah. >>: Like there was a misstep there where translator three was consistently horrible on every single sentence. >>: Yeah, out of four references he was consistent. >>: Did you measure the [inaudible] post editing for different MT qualities? Like if you are [inaudible] new translator? >> Philipp Koehn: Just one system. Yeah, that's a variable and it does matter a lot, so I know from just one of the translation agencies that we work with, they offer, or they do post editing for MT for French and for Spanish, but they don't do it for German because of the quality of their MT engine in German is not as good. So the quality is definitely better, so there is one point in the quality curve of MT where it becomes good enough for post editing and that's very -- and we reached that point for several language pairs, especially with restricted domain, so that is kind of good news for MT. There is a real use that people in the industry use post editing MT and they get better results and they are faster and so on. But we haven't reached that point for all language pairs and we don't necessarily have it for all of the domains that people want to translate stuff. >>: Do you think you would push for more MT quality and use regular post editing or is better interpreters [inaudible] >> Philipp Koehn: So we do both. Actually like I said we have applied for too much funding and in this case we got too much because [inaudible] [laughter] and one is kind of run mostly by this translation agency that -- I don't, all of this funny stuff you got to put on the screen is just going to annoy translators. They're not going to -- they don't like this. They want to at most post editing MT, that's all. They are used to translation memories. We give them out one additional option which is MT and then maybe hide it that it's MT [laughter] and that's all we are going to give to them. We are not going to kind of throw funny colored stuff on the screen. So they are the main focuses here. Also like how can you improve MT like incremental training, those kinds of ideas? And maybe show some things with confidence measures to not show the bad MT or word level confidence highlighting words that are more likely wrong. >>: [inaudible] translators is going to be completely different feedback so that's why the agencies are [inaudible]. It's postdated for unassisted translation. Translators tend to read the sentence and they already have an idea about how to translate it. It takes very only 2 seconds, to come up with something compared [inaudible] actually not used to that exercise doing their exercise over and over again. So if you compare that with MT it's going to in most cases it's going to slow them down because they have to advise somebody else’s translation. But for human translation the translator is a [inaudible]. >> Philipp Koehn: So there certainly is some high bar for MT to pass, otherwise it's completely useless. It's just like 50% of the sentence is just rubbish and then you slow down the translator so much looking at all of that rubbish, so it has to be good enough that for the sentences you show the vast majority should be helpful. Otherwise it's just a waste of time. I mean there are studies around people trying obviously different [inaudible] pair and task but the few ones that I know [inaudible] where people get 50% faster for 80% faster depending on language pair, so this is MT engines that are geared towards their data and they have been trained on their data, and used on just their data. And so these are kind of, yeah. So numbers you actually get reported back from translations is maybe 20% and 50% faster. It's not like three or four times faster. >>: [inaudible] human element apart from all the technology [inaudible] there is potentially [inaudible] technology coming into what you have been doing manually forever and some people will resist it. You have refuseniks [inaudible], so that also comes into [inaudible] apart from how good the technology is how you on board the people who use it to make it. You make them amenable to use the technology. >>: About 10 years ago the same process was happened to translators here when we started to use translation tools [inaudible] memory [inaudible] recycling tools, so many people have trouble actually getting on board with that. Now everybody more or less is on board with translating [inaudible] using those tools and some people actually even provide suggestions. >>: Many of them. >>: Yes so that, addressing that even in packaging [laughter] [inaudible] intentions. >> Philipp Koehn: Yeah. Translators, it's a very, very diverse group of people so I don't know how much use this is. I know from the AMPTA conference there was one of the keynote speech was given because previously they were co-located with the translators’ organization. The translators learned the machine translation [inaudible] >>: [inaudible] I think they were more open [inaudible] >> Philipp Koehn: Yeah, so the ones actually go to AMPTA are probably open-minded [laughter] [multiple speakers] >> Philipp Koehn: [inaudible] conference I've been only once and there was a majority of translators and it was more hostile. [laughter] the most hostile audience I ever had. >>: What would the next part of the talk the? I know you won't have time to get through it. >> Philipp Koehn: Yeah, we are not going to get through it. I just going to quickly, maybe that's worth going into. >>: [inaudible] monolingual translation. >>: Yeah, I was thinking [laughter] >> Philipp Koehn: So let's do the monolingual translation. So the idea was, so the idea is basically straightforward. If the MT system produces the girl entered into room you can figure out what it means and you can fix it so you don't really even know the source language. It's just what makes sense in the target. And here is… >>: Sometimes you, sorry, sometimes it's [laughter] especially with statistical machine translation can produce something fluid but totally not related to the subject. >> Philipp Koehn: So accuracy is a yeah, so the quality is different now. Quality metric you don't expect too match human translators or [inaudible] professional bilingual translators or someone like this. Okay. Here's the task. That's a pretty picture. This is how we set that up. This is the one that he can't read and I'll show you the one he can read. So this is the Arabic sentence. We did it again with people in Edinburgh, students that don't know any Arabic. Squiggly symbols. It helped a bit probably to see if these were long words, short words but that's about it. 2008 [inaudible] figure out what that was supposed to mean. So this is, here is the part. Can you figure out from that the translation? So step one is to just do post editing MT. Step two is you do this and this was yeah, not that much better as we would've hoped. There was like one story where it clearly helped people, but otherwise a bit of a mixed bag. >>: In this case it seems like you have a lot of contacts in your head where you know what's going on. >> Philipp Koehn: So this is just real-world knowledge. One big thing we found out is that it matters a lot. Oh, like that wasn't really obvious [laughter]. The more you know about something the better you do it. So there was one sports story about an American soccer player playing in the Cup of America. If you know anything about soccer and you know this tournament and how it works and how they play, you did much, much better than someone who just, you know, playing against Columbia. What does that mean? Is a good thing or [inaudible] expected one? >>: What is the scenario for monolingual translator? For me that sounds like an oxymoron. >> Philipp Koehn: It's a very oxymoron. I think that's actually a real, I mean, there is this kind of whole DARPA scenario of the analyst who wants to find out about some foreign document text, so you can give them a machine translation or you can give them this and maybe they figure out more about the text with this, so they actually… >>: [inaudible] publications scenario, right, if you want to translate what would be into a new language then you have a bunch of speakers of that language but not bilingual then if they could fix it up to readable quality then you can use that to reproduce… >>: But then there's also the crisis scenario too. I mean the example in Haiti where you have a triage. You're not going to triage in Creole because no one speaks that language. They're going to need to triage. I mean no one who's providing aid speaks the language, so you need to do triage in another language so [inaudible] the language [inaudible] >>: Can you really get it if you're not speak the native… >> Philipp Koehn: Good question, so these are the numbers we got out of it. Yeah, I think these are the highlights. So this is broken down for the different stories. So this is the bilingual translator and there was huge variety also in the different translators. I should show that too. So do I have that? This is how different people fared and some didn't do very well and some did really well. This first one did really well. This was on Arabic stories and Chinese stories. This is the bilingual translator. I think he kind of reference this one really bad bilingual translate. So this was a test set, so we had three bilingual translators and a reference so we could pretend these were independently [inaudible] independently translated and you could score them. There was one bad bilingual translator and three of the monolingual translators were as good as them. So there's a huge variety. Okay. So now average of all these people you might notice [inaudible] this is now average over stories for these, and different translators, so this one is where they are showing the options instead of just doing post-editing of the MT often helped but for the other ones it didn't matter that much. There's one story there. >>: The average across… >> Philipp Koehn: Of people. These numbers here are these 10 monolingual translators were all the same kind of people and they know the source language. And they all translated the same stories and these are averaged over three bilingual translators which is the missed sentences and the same metric saying is that correct or not based on the reference [inaudible]. So numbers… >>: So the real question is there is no way to bias the mission translation system in a way to assist a monolingual post editing? Because obviously right now they're doing much worse than [inaudible] >> Philipp Koehn: Yeah, so there's some obvious things that kind of jumped out. Names, If they are not translated right there is just no way, I mean there's hardly any way for human to catch that, get that right. >>: But even still you are having the [inaudible] significance in it [inaudible] >> Philipp Koehn: That's a good argument. [inaudible] point [inaudible] [laughter]. >>: [inaudible] other direction. I'm just talking about [inaudible]. >> Philipp Koehn: Yeah, yeah, so it's not -- and it works. I can show this in the tool too if you want to play around with it [inaudible] no I can't go back, or how does this work? Anyway, it's fun to do this. We did something on very early on in [inaudible]. I think it was about 15 years ago. Just give them like the output of MT or phrase tables and can you figure out what the stories are and just knowing what makes sense and what doesn't make sense, if you get five content words you can kind of puzzle it together. I mean you can -- you might be completely wrong [laughter]. >>: [inaudible]. >> Philipp Koehn: You might be completely wrong. You might miss the mark. >>: No, no I mean [inaudible]. You would focus here on the pieces among different things so, for example, pay very close attention to [inaudible] getting right. [inaudible] on the main [inaudible] [multiple speakers] [inaudible] >>: You don't care about the what? >>: You don't care about the [inaudible] so much because the modeling readers can do the… >>: As long as they can [inaudible] close enough themselves to be complete word salads because otherwise [inaudible] >> Philipp Koehn: Because here there is no language model. So if you just put this together you can figure out what [inaudible]. The human model is much better than the engram language model [inaudible]. [laughter] >>: [inaudible] don't need a language model. You don't need a language model to produce this if you have [inaudible] >>: Just pay attention to the contextual [inaudible] because you picked the wrong sense there's no way well, you need a lot of word knowledge to know that it's not the right [inaudible] so engage in word sense [inaudible] >>: Can you model the human, the one’s background knowledge, your common sense knowledge for this Arabic news article, is there technology to find a comparable U.S. English news article, have the translator read that and now use the options. >>: That's kind of like triage in a way. >>: Because now you kind of [inaudible] >>: That's kind of like priming with relative things. >>: Yeah, because now you know that, you know, the kind of who does what to who and then you instead of… >> Philipp Koehn: Yeah, so the typical scenario is that the user knows the content. I mean he cares about the content. Why else would they do this? >>: [inaudible] scenario, but in other scenarios, in something -- if by finding the comparable sources and having like the monolingual translator read those you can kind of control for how much one translator knows versus the other. >>: But in this case you're not really a translator. Where you are… >>: No of course not. You're not even interested in post editing. You want to understand is this something like at the DARPA conference, listening to signal [inaudible] most of it is knowledge. Do I want to invest in a professional translator for this particular communication that could be interesting or where I see something? You're not really embellishing the output. You are really… >> Philipp Koehn: I think it varies how much detail you are interested in. One question is could I give this to a professional translator and if the answer is no, then the quality level doesn't have to be all that high, but if you want to find out when the bomb is going to explode and all you have is this hand written scribbled thing and the clock is ticking and [laughter] you want to know more detail. >>: You don't really care about the final translation. All you want to know is I'm standing the content and this is beautiful because you can go through the content and get an idea what it tells you, but you're not going to spend your time [inaudible]. >> Philipp Koehn: Yeah, and you get some sense of, I don't know if this is a good example, about the uncertainty of certain things. Like there was a subway. There was something about a Muslim brotherhood story in Egypt, about the group where the word abortion came up, so it clearly didn't mean the general word understanding of what abortion is. It was kind of an abolition of the group and you couldn't figure it out because it was the web. The government and the abortion of the group and yeah, that means they want to ban the group. >>: [inaudible] says Hamas [inaudible] dictionary [inaudible]. >> Philipp Koehn: [inaudible] ranks these things. Yeah, it has a probabilistic way of ranking them, but yeah, it's a probabilistic dictionary on a [inaudible] level, but it doesn't use a sentence context so it doesn't really do -- the way we do this it doesn't use a sentence context. You could do it in a way that it uses sentence context, yeah. >>: We should break. Fill out your… >> Philipp Koehn: Yeah, we already lost part of the audience. [laughter] [applause]