1 >> Michael Gamon: One of the reasons I was asked to actually introduce both of these people -- both of these people. Both of the speakers today, sorry. I can get away with saying that, I guess, because I actually have worked very closely with both of them over the past few years. Ask him there's also an interesting kind of history between both Fei and Hany. Both Fei and Hany are ex-IBMers. They both worked on machine translation for IBM. And so they're the same group although for much of the time, I guess you were in Cairo, is that correct, and Fei was in New York. Although there was a time where they were together in New York and so when Hany was coming to our team, he says, do you know Fei Xia? I said Fei, I know Fei very well. And then it was like old home week when you guys got to see each other again after a number of years. So there's kind of an interesting twist here. This was not planned. We didn't plan to actually have two ex-IBMers giving talks here. But it gets even more interesting, because as you know, Hany works here at the machine translation team here at MSR, and Fei's co-author, as it turns out, is also a Microsoft employee. So we have a paper that is both being presented by two ex-IBMers and two current Microsoft employees, although Fei is not a Microsoft employee. I guess we'll have to work on changing that or something. But Hany started, I guess it was four or five years ago now, and as I mentioned, he was at IBM before. He worked on machine translation and information extraction. It was a kind of combined group. I also happen to know Hany's advisor from Dublin City University quite well, Andy Wei. He and I are good friends too so it's kind of this other little bit in there as well. So unfortunately, I don't know what year your Ph.D. was. He got a Ph.D. from DCU, so we'll just say that. And he was -- it was a real pleasure to interview Hany a number of years ago. He worked remotely for a year and was great actually having him come to the team. He's exceptionally productive on the machine translation team and does a lot of really cool stuff and it's been a pleasure, actually, working with him on some of this cool stuff, like the text corrector and some of the other things and this is something I didn't have the pleasure to work on, but I'll turn it other to Hany. The talk today is graph-based semi-supervised learning of translation models. Hany, the floor is yours. >> Hany Hassan Awadalla: Thank you. Thank you all for being here on this sunny day. So this work is a paper that will appear in [indiscernible] this year, graph based semi-supervised learning of translation models. The work is done with [indiscernible] last summer in cooperation with Christina and Chris. So this -- in this line of work, we're trying to leverage the multi-lingual corpus. We all know from machine translation that we really need sentence 2 around corpus. We really need word around corpus to learn the translation rules. So we are trying in this work to challenge that assumption and measure how good we can learn translation rules for machine translation for lingual corpus [indiscernible] comparable corpora and even from multilingual corpora that is not related at all. So why are we trying to do that? As you all know, a lot of the parallel corpus on the web now are not really paren. Maybe it's comparable but not paren. The real [indiscernible] corpus are mostly now machine translated already, which cause a lot of problem and a lot of noise there. On the other hand, the multilingual corporas are really available in many domains, in many areas and produced on regular basis for [indiscernible]. So if I am trying to, for example, train English system for medical domain, I can easily find more lingual data on both domain but not that parallel data or even comparable data. As a note from our own very large machine translation system, we usually utilize any [indiscernible] utilizes shorter phrases, not longer phrases, and the reason for that actually because we don't have enough data to learn the longer phrases, two or three or four grand or five grand. It's very sparse and we can't learn that. If we have the phrase tables in our machine translation system learning four or five words, we already [indiscernible] one or two words at most, because the first person we can't learn enough statistics to have reliable long range or longest parallel phrases. On the other hand, for multilingual data, it is very easy to be learning and to [indiscernible] in a [indiscernible] similar to the multilingual language model. So the problem we're trying to tackle here is handling the machine translation learning of translation rule similar to the way we're doing monolingual language model, using [indiscernible] from the lingual corpora. In this work, we'll try to answer a few questions. One of them is very challenging, do we really need sentence aligned or word aligned corp to us learn translation rules or not. And this is very interesting point. The answer is no, if you are curious what is answer the now. Which is a little bit surprising. Second, can we have -- how good would translation rules be if we learned from the monolingual data because most of this work done before in that area trying to learn how to vocabulary is just to try to compensate for [indiscernible] but learning real phrase-to-phrase translation like two grams to three grams, four grams to other four grams is not covered in this area tool. So we try to evaluate how good those translation rules are. Can we leverage monolingual corpus not comparable at all that is not related to each other to learn translation rules. We'll try to investigate that issue as well. And the fourth one is a little bit tricky, and it's been advocated towards machine translation community for a long time that we really need very large language models and that's it. If we have enough large language models, it 3 will even get good translation for more translation without any need for phrases. So we challenge that function as well, and we will see if we really need language model or we need longer phrases in our phrase table. As you transitions in phrases that we see in our [indiscernible] production system, for example, is 1.4 words learning, which means we are utilizing [indiscernible] during translation so if we learn to try gram for gram that we really can use during recording would it help the language model interact with it or is the language model on itself we've covered that issue. So we'll try to answer those few questions here. So this is very high level overview for that kind of work. So simply we start to a machine translation system that is a phrase for example translation system as usual and the assumption that we'll have a very large graph. This graph will query for giving any gram here for the source, like two or three words. We'll query that graph asking for translation. For never seen by [indiscernible] before. And those translation would be fit into the run time of the translation system to continue the usual translation process. So in this talk, we are focusing on how we construct this graph to query it for any gram on the source to return any gram of the target. The work done on the run time of the machine translation system itself is minimal so we are only talking about [indiscernible] processing here that can produce new [indiscernible] tables that can be used in any translation system. The translation system itself is, as usual, any vanilla phrase that you could. So taking a little bit dive into how this graph is constructed, we'll start with a translation model here which we need from [indiscernible] start with even few hundred thousand sentence just to have an idea how this language match each other and we start with many sentence on monolingual corpus. In this monolingual corpus will fit into the system and decide if we know those phrases or don't know that phrase. Phrase means here it's in a gram. So it is a phrase with a definition of phrase [indiscernible] sequence many grams. No linguistics information, just any gram sequence here. And when [indiscernible] is labeled mean do we know a translation for this from our translation table or it is unlabeled, which means we don't know any translation for that [indiscernible] and then we need to add this to our graph, trying to start the graph propagation and to augment our translation models with those translations. So this is [indiscernible] process when we try to enrich our phrase table with new entries based on some given source. And again, will show how we construct this graph through queries of translation. So this is what our graph should look like again, this is this is two different graphs, one source side graph, one target side graph in the [indiscernible] of each other, and each one is constructed on monolingual, not really related to each other at all. And the main objective of this graph is to end up with translation 4 distribution for each node in this graph. Each node in this graph is a phrase bigram or trigram, for example, and the distribution is target phrases with their probabilities. As you can see here, the dark nodes here are known, meaning labeled or we know translation for them, and the white ones here are unknown and you don't know any translation for them. The main objective of that process is to, like, propagate information through the graph to end up with distribution on those nodes similar to their distribution, and this would be added to our translation table. So again, this is two different graphs, one constructed on source side, one constructed on the target side. But as you can imagine, the definition of phrases here is very loose so this graph can be very, very large if we consider all possible any grams or large monolingual corpus. So we have some restrictions in how we construct this graph with some tricks to make sure it is still trackable even with high computational resources. So we start constructing our graph from the various sides that we have, and then we have some target mode this is phrases on the right side. Some target source mode which is on the source side and that is the only link between the source graph as a target graph. But the source graph itself is constructed from the source side monolingual corpus and the target graph as well. If we restricted ourselves to those kind of nodes in that graph, means that we need to translate phrasing that were never seen in the translation before. We have only candidates from our phrase table, which is very inefficient, because that means we can't translate some phrase to a similar target phase to [indiscernible] but we can't translate to new phrases that were never seen in the target. Again, if we are trying to translate this phrase that we never seen in our data before, we need to translate it to some target phrase as a graph. And the graph constructed only from the pair of corpus is very limited. It doesn't have enough variety to provide us with various translation. So we need to expand this graph farther to have more candidates for those target translation. So the candidates can be from any initial candidates that we can provide, like bootstrapping from other -- the same baseline system, bootstrapping it to get graph translation for those phrases and add those to the graph as possibilities. Or generating for [indiscernible] variance for those phrases in the source side trying to get similar phrases on the target side and then we enrich that graph to have many, many nodes that we can be -- can be translation to that source phrase. Again, if we didn't just do that restrictions, we'd end up with very large graph on both sides and still not computationally efficient to do that. For that, we do that restrictions to just make sure that we have limited space on the graph that you can propagate with. Any questions? Okay. So that is how we generate the candidates for our translation. Again, we have different source for generating the 5 [indiscernible] candidates first, assuming that we're trying to translate that phrase A in the [indiscernible] that were never seen but this phrase is similar phrase [indiscernible] which is we [indiscernible] this similarity is constructed but every source of phrase is similar to other source of phrases. So a possible translation for that would be the translation for those similar target phrase. That's one possibility. The other possibility is the similar other target to that target phrase. So now we keep expanding that graph to add similar candidates as well. Third possibility is morphological similarities that we generate on a same level, similar candidates for the target and we keep expanding the graph to have a lot of [indiscernible]. We keep like the top end best candidates, their source modes, and from here the main objective again is to propagate the probabilities and the distribution on all that unknown nodes. So now what I described so far, we only have the topology of the graph. So we know what is the nodes and we know -- we don't know -- we know how they are connected, but we don't know even -- we didn't define that connection yet and we didn't define the [indiscernible] yet. And this is very open issue. The second thing is we need to do propagation whenever we define the [indiscernible]. So the first, which is a graph called construction, what is the [indiscernible] constructed. This is very open question because you can have a lot of possibilities on how two similar -- two phrases are similar to each other. For example, you can have distribution of similarity. You can have vector-spaced models and you can have a lot of variety that can drive a similarity between two phrases, very open research problem and a lot of work to be done there actually. Even that kind of work we are just now touching the surface here and that nodes or edges are very symbol [indiscernible] information based on the context of the phrase. So for each phrase, we /SKHREBGT statistics from the monolingual corpus. The vector to the right, the vector to the left, back and forth towards each side, and you construct a vector for each of those nodes and the similarity bit wise similarity is [indiscernible] similarity between those graph. We keep 500 neighbors for each node of the graph so it's a very dense graph. But the age similarity here is define here is just the pairwise neutral [indiscernible] which is very limited. Knowledge for now since this phrase and it can have [indiscernible] information, semantic similarity. It can even have, like, syntactic signature on the gaps, on the borders to derive some similarity. So there's a lot of work to be done here on driving how the similarity between two phrases is. Again, the simple solution here is just pairwise neutral information for that work we're presenting. Second, since we now have the edges here, everyone has weight, which is a combined similarity, we can now propagate. Propagate the knowledge on that graph to have nodes and distributions here. It means we need 6 to propagate the distribution on this source mode from all possible connections to end up with a distribution like this for every node. And creating that graph later, we can have phrase translation for any given phrase. That propagation problem is easier, actually, once you have a good dense graph. It is easy to do propagation. We have two option of propagation here. The simple one, which is labeled propagation, as usual. And here, the probability of that phrase having that -- being a translation of the source phrase in the iteration T plus 1, depending on you're coming from iteration T and that probability is just a summation of the weight on that neighbor for that node. This has some limitation, the neighbor propagation approach, because it only depends that one side of the graph. As you recall, our graph is a little bit complicated when we have two sides, source side and target side. The label propagation here will only account for the neighbor on the source side, but not the target side. For that, we prefer using structured label propagation. Structured label propagation is the same here as label propagation. But as you can see here, it has another term that propagating is [indiscernible] for the target similarity. So we are propagating on the same graph, we can have similarity on the source side, similarity on the target side, and the same propagation [indiscernible]. So whenever we are propagating a translation, we are making sure now it is a source similar to each other. Its target is similar to each other using that propagation mechanism. Structured propagation is very efficient. It only takes two iterations on that graph. This graph is composed of like 600 million nodes and it takes only 15 minutes to propagate that. So it is very efficient. The computational [indiscernible] is in constructing the graph, because again, we are using bit wise neutral information to construct the graph with 500 neighbors. So it is very, very enumerative to do that computation. But nowadays, larger clusters, it is easy just to [indiscernible] 500 machines to do that [indiscernible]. But still, it's computationally expensive. It is not cheap, as any graph is technique. So now, returning to that graph again, we ended up with having -- all those nodes having distributions, which is translation probabilities and you end up creating all those to construct a new [indiscernible] table. And for that, we can use it during translation. For our evaluation, we used two systems, Arabic to English and Urdu to English. Arabic to English is a not [indiscernible] language, but we have [indiscernible] as a [indiscernible] language but because it has a lot of corpus in real comparable scenarios so we can measure how good when we have comparable corpus can do versus non-comparable corpus. From Urdu to English is a very low source language and there's few parallel corpus out there for Urdu to English. And we examine more scenarios here. So that is the Arabic-English training data for the parallel data and that is the Urdu-English, which is 7 [indiscernible] evaluation data. For the comparable corpus are from the [indiscernible] corpus Arabic and English, and those are really comparable data. It means that [indiscernible] corpus [indiscernible] which is really similar to each other and this will evaluate how we can get from similar data same source if the corpus [indiscernible] so as you can see, Urdu-English is longer, but we can see that they are talking about the same events, same contents. For the Urdu-English, this is noisy pair data. That means it is [indiscernible]. It can have machine translation content, can have very noisy content, but still this will give us an idea how we can neutralize this data [indiscernible] as well. And [indiscernible] here we are [indiscernible] Urdu and English monolingual corpus. Not related to each other as well, just [indiscernible] monolingual corpus to [indiscernible] language model so evaluate the [indiscernible] on those data as well. So just analysis on why you are not tackling all the issue as most of the work done using trying to [indiscernible] corpus before. It is really not all the issue. As we can see from that statistic, that is from the [indiscernible] test here, number of sentence [indiscernible] here and here's the number of bigrams on those sets. Unknown bigram means that I have never seen that in my first table before. So that this means that it is almost 56 percent of the bigrams on this set not ever seen on the first table or not seen here in my [indiscernible] corpus at all. So for more than 56 percent, we resort to more [indiscernible] translation. >>: Is this which language pair? >> Hany Hassan Awadalla: This is Arabic-English. So means that for those percentage we resort to one [indiscernible] translation and our [indiscernible] model depending on our language model to put pieces back into order. For that, we tackling the bigram issue here trying to compensate for that issue. If we break this down as more like what is those bigrams known and what is unknown, unknown means the [indiscernible] vocabulary. So we know that [indiscernible] in vocabulary and in this case, we resort to having [indiscernible] translation. Known unknown means one of them is known [indiscernible] the other one is really [indiscernible]. Post unknown, which is very [indiscernible] only from those hundred [indiscernible]. So our real focus on getting read bigrams, that is even known but never a build together in our [indiscernible] corpus if we can acquire translation for can be more [indiscernible] that is the assumption we had should we attack this problem or should we have bigrams, trigrams to compensate for that. For the Arabic-English result, that is the baseline on this data. The evaluation. 8 [Indiscernible] is MT over 6. This is MT 08. The language model is like 50 million sentence, I think, like. It's small language model. And then we can see here the baseline and that is when we use the graph propagation using one gram only. This case, means that is [indiscernible] and you can see here it has almost no effect. [Indiscernible] and very small effect on the div. So we really want to handle the bigram case and in that case, that is using the label propagation when we take into account the first similarity, but not the target similarity. It tends a little bit, almost here one [indiscernible] point or more and here this is a [indiscernible] point. And when we add the structured label propagation, when we take into consideration the source and the target as propagation criteria, we get much more improvement on [indiscernible]. There's an issue here a little bit [indiscernible] again, which is at the beginning, since we are adding bigrams as one piece, should this be compensating the effect of the language model, because if we have large enough language model, it can cover those bigrams. We have here like small language model. So what if we have very large language model with marginalizing the effect or it will be add-on for the same effect. So we switch to using our very large production language model, which is built on like 9 billion tokens, and here is the baseline of the large language model. Jump it like two PowerPoints from here. And then we can repeat the system again, which is the system here, and you can even see the improvement is even much better than the smaller system. So that means that even with using very large language model built on nine billion tokens, our production actual language model, very, very, very huge system, we still see the same improvement. That means we really make improvement from having the phrase as phrase translation not from the language model [indiscernible]. If we're not seeing the same, we weren't sure what were the effect of adding those bigrams because it would stick them together and compensating for the language [indiscernible] will affect or not. But as we can see here, it is really not the language model affected in the translation [indiscernible] effect. Here we evaluated the comparable case because this data used with the graph machine translation built approach are the [indiscernible] comparable corpus, which is [indiscernible] from similar data so it means that we need to evaluate the other extreme case when it is not such [indiscernible]. And we did that with Urdu, actually. The [indiscernible] baseline using the NIST data. We don't have any graph-based approach here, and that is the dev and test [indiscernible] and then here, that is another baseline, adding the noisy [indiscernible] corpus which is [indiscernible] from the web, which give a lot of improvements here. Those are 160 [indiscernible] and those are 450 [indiscernible] sentence. So it is quite like [indiscernible] amount of data. But as you can see, even just noisy crowd from 9 the web, it can give a good jump here. And that [indiscernible] are trying to evaluate the effect if we are using that noisy parallel data, that's comparable corp. So we remove the sentence alignment and [indiscernible] chunks of comparable corpus and feed that into our graph construction system, construct the graph and the source from the noisy from the [indiscernible] noisy and we try to measure how good we can learn [indiscernible]. So that is the baseline and that is the translation rules added from handling this data as comparable corpus. And if we [indiscernible] parallel corpus because it's [indiscernible] already, we won't get that much. If we're not handling it as comparable corpus, we get that much, which is very close to what we get from the parallel data. And this is very important observation here because that means that we can learn new [indiscernible] data without even old alignment. These data just in the gram extracted from those data and we construct the graph using both of them, and we get translation rules adding it to that system. As you can see, while losing a little bit, but the prize is that this is not [indiscernible] data. It's just similar data, which is I find this one of the most important finding of that work, actually, when we can get relieved from having very parallel data and try to extract from non-parallel data. That is at [indiscernible] corner here when we just have monolingual corpus. That we [indiscernible] using for the Urdu language model and for the English language model. We don't have any similarity between. They're very different. Nothing at all. And you still can't get improvement. This one compared to that one. So we still have one [indiscernible] to almost [indiscernible] between that and that one, which means that we even can't learn from monolingual not related corpus at all. So I find through all English cases much descriptive than the Arabic language because it's really low source language that we don't have any data for it. That is the only publicly available data for that language pair, and that is what we have, though we have a lot of [indiscernible] that is all [indiscernible] you can get from the web for that language pair. So such small language pair can benefit from these approaches a lot. I'm maybe out of time. So I have a few, couple of slides here. So that is a few examples from Arabic-English. Here, that is the bigram where we are trying to translate and it is not in your phrase table. The translation is bigram, but it is [indiscernible] difference here. And our baseline just drops at present. So it should be U.S. [indiscernible] of U.S. president here, and then we dropped it because that is very usual in phrase-base translation when you have a [indiscernible] the system prefers to be brief and drop unknowns and the language [indiscernible]. And this is our system when we have used presidential envoy. As you can see, the [indiscernible] may not have any different at all, because presidential, not president, and so on, but this is 10 much better here. This is good example, and that is bad example. This bad example here, we have that this guy, there's some name, said blah, blah, blah. The baseline is very brief. He said. Because we don't know him. Which is good. And then our system proposed another name, very similar name. And this is the drawback of the pairwise neutral similarity. Both name will appear in the [indiscernible] in very similar way, and we don't have any better features to model that here. So Abdullah and Abdalmahmood and Mike would be in the same context, right. It is the problem of very simple feature that [indiscernible] here. For that, we are moving away from the pairwise mutual similarity to more sophisticated features to have better [indiscernible]. For the Urdu-English, that is the bigram towards here. [Indiscernible] let me know. And that should be, I am hopeful. The baseline, that's this hope. And we propose that I am hopeful, which is much better than this. Again, this is varieties of another examples that we have and where it came from. So here that is the Arabic, sending reinforcements. The base line is strong reinforcements and we get sending reinforcements. It means that it is one of the neighbors. It is already on the graph. From where, it is neighbor for one of the [indiscernible]. But we can pick it as one of the translations. This is the same with OOV in the baseline, but you can get it. This is one of the neighbors again. And this was address, and the baseline can get the same translation. Different tense, but for [indiscernible] systems, okay. And then here is a very good one as well. Both are good. Baseline is not bad, but we are having better here. G means generated. G means it came from the morphological variance generation or from the baseline rough translation. But it wasn't not one of the neighbors. We generated it based on some similarity. This OOV, those are Urdu examples. To defend him, this is clearly morphological generated, and we get him to defend himself, which is much better. While speaking, in the. So we get a lot of those improvements that is really in a gram improvement, not OOV. Even though OOV, the [indiscernible] OOV. So we get a lot of new phrases that we really propose new translation for. So we hope that this promotes a new direction for machine translation and when we get [indiscernible] on getting more parallel data and maybe focus more on having more, revisiting the graph-based techniques which very efficient but due to the computational [indiscernible] didn't get into the game until later. But now with the more? Cheap computational resources everywhere, I think it is time to revisit such approaches. As you can see, once you constructed a good graph with [indiscernible] more work on how we construct the graph is the important thing there. Our future direction here, we will -- we are actually switching to heterogenous features for the phrase pairs. We can have syntactic similarity. For example, [indiscernible] natured. We can have semantic 11 similarities. We can have continuous vector representation for the phrase and then we get relieved from that pairwise mutual computation. So the phrase are just the vector, and this will give us new dimension for how we can construct our graph. The feature is that vector would be related to whatever you want to design there. The bottleneck here is that you need now to learn the similarity in better way. So -- and this is still future direction that we will try to tackle in this final [indiscernible]. I think I'm done. So if you have any questions [indiscernible]. Thank you. >>: So I'm lucky that I actually know how to pick [indiscernible]. But I noticed that you did very well in Urdu morphology where -- so [indiscernible] means hopeful. Just [indiscernible] means hopeful. Hope, actually. So it's ->> Hany Hassan Awadalla: What is [indiscernible]? >>: [Indiscernible] is morphology to change it from hope to hopeful. morphing. >> Hany Hassan Awadalla: The But it's two words, right? >>: Yeah, [indiscernible] itself can exist on its own so root morphing. But it always has to attach to another word. I've been impressed that you were able to go deeper in the morphology and again the next, the table, the next slide, [indiscernible], the last line, which [indiscernible] conversation, and key, which is about morphing on top of that. So in the is just the key part. So you got -- so I can see where you got the improvement, where you managed to find the morphing inside, translate, and then get a context sensitive translation. But I don't [indiscernible] Arabic where the common proper nouns and common nouns occurred together because people's names have common nouns in them. And we know this [indiscernible] means slave, but we also use it as a name. Abdullah, all of those. I wonder if your struggles are occurring because names have more themes that have common noun meanings as well. And maybe therein lies some of the trouble. >> Hany Hassan Awadalla: So this is a very good question. The morphological candidates are generated using stem-based translation so we don't know any stems for both language, but Christina has an approach to learn the stem based on word alignment. So from those alignment we learn the similar stems. And then we generate stem-based translation, which is very, very rough translation. 12 Can be good. Can be bad. But can give you an idea, given the context, we may make it or not. So that can work very well with linguistics like prefix, suffix. For people [indiscernible], this is part of a name and it is actual. It is just likely to make a collision with the actual word because like the actual word, using that is less frequent than the data and the people name. But for the case of the Urdu, when you have the actually prefix, suffix, I think using word alignment, it will be learned much easier. Maybe that is the reason for having that. >>: You have a lot of good progress, even with this. come next and excited about what will come next. I'm interested what will >> Hany Hassan Awadalla: We're hopeful that we can learn those features in that presentation and we'll get a [indiscernible] of having [indiscernible] because it can be similar to each other, still. So we hope that in the Urdu case, like the feature would learn that this piece is really related to that part, not to that part, and so on. Now we are doing it in more like [indiscernible] way, but we hope we can learn the features in more systematic way. >>: I'll use it. If it ever shows up in translator, I'll use it. >> Hany Hassan Awadalla: You should use it now. So this work still, it is early research. It takes a lot of computational resources to compute that table and that extension. So even for Bing translator, we would have to generate [indiscernible]. Like, for example, we take a lot of data, generate new tables, and be available [indiscernible]. It wouldn't be on the fly for now until we have more cheaper cluster [indiscernible] resources can be on the fly. >>: Maybe next year. >> Hany Hassan Awadalla: Maybe. >>: In both of these case, you go into English, which is morphologically poor. How do you predict your results going to more [indiscernible] language? >> Hany Hassan Awadalla: Good question. We didn't try anything with the other direction. We should try that. That is interesting, but we didn't try it. 13 >>: Less cases that this would trigger, right? Because it is simpler and the vocabulary is smaller so you'll have fewer unseen bigrams. >> Hany Hassan Awadalla: You will not have the issue of morphological variance. It will not be that important because you are going the other way. So even for the English to Arabic, going like even with very large systems, we have the problem of not able to generate phrases that is highly morphologically affected. Like when you say in Arabic [indiscernible] for example, that's [indiscernible]. So we can generate that even in Arabic form, and we resort to simpler forms. So if you can catch that is one to many or many to one [indiscernible] but [indiscernible] it's interesting direction. >>: So it seems also that it's not just that you're generating more [indiscernible] basically for your phrase table, but maybe the previous page has a couple examples of your reordering model. So it's not just that you've got the dropping replacement for phrases, but there's a much sort of bigger set of interactions is happening. So are you explicitly doing anything about that, or are you ->> Hany Hassan Awadalla: [Indiscernible] hierarchical ordering, hierarchical ordering model, but too ->>: [Indiscernible] because I mean, so, right now, it's the graph propagation gives you what you extract your phrase tables from. So how are you generating your reordering model? >> Hany Hassan Awadalla: [indiscernible] data. Reordering model was trained on the small >>: Okay. So the reordering model is basically just given now a subtly different set of inputs, it's doing what it would have done anyway if it had had ->> Hany Hassan Awadalla: Yeah, but since we have new phrases and retune our parameters based on the new phrases. Like in this case, some of those parameters will be different than the others. So you change the parameters and that is can it change [indiscernible], but it's not much different in terms of reordering. >>: It's not too much different, but it is interesting to see that these 14 small, what you basically planned as being small drop-in replacements for, like, bigrams are propagating effects that are quite sort of substantial. >> Hany Hassan Awadalla: But cannot fit. The [indiscernible] for example, the phrase, all those can be affected when you introduce similar phrases. >>: Yeah, but such a small trigger is generating this. >> Hany Hassan Awadalla: It's interesting. Thank you. >> Michael Gamon: We're going to move on to the next talk. This is Fei Xia. I guess most of you know Fei. She was one of the founding faculty of the comp link program at UW. Starting in 2005. And actually, we've known each other for years. If you happen to know Fei, one thing you know about her is she's tireless. She's dedicated. And maybe relentless. These are all good qualities, actually. If you've ever worked on a paper with Fei, you know exactly what I'm talking about, or if you're a student of hers. Anyway, I talked about the IBM connection already. So I'll stop with that. She's currently an associate professor in linguistics department at UW, so I'll hand it off to Fei at this point. I was going to say more, but ->> Fei Xia: Thank you very much for the introduction. So I want to say a few words about my co-author. So Yan actually was still at the city of [indiscernible] Hong Kong and he went to UW as a visiting student a few years ago and we work on this together. And after he left, we continue the collaboration. After he got his Ph.D., he actually went to Beijing, joined Microsoft over there. So that's a connection. So today I'm going to talk about domain adaptation, and I think we actually have several experts in this room for domain adaptation. The goal here is to say that when you train your system on any label data, if you test it on a different domain, the result can be really bad. The goal of domain adaptation is to kind of bridge the gap, and there has been a lot of work on this, and I'm going to cover some of those quickly. So in this talk, I'm going to kind of discuss a few recent study we did on domain adaptation. I'm going to first talk about related work and then basically tried three different approaches and I'm going to go over each one. And for each one, I'm going to show you some results. So for latent work, one way to look at this, because there's so many work, is to look at assumption. So for domain adaptation, you always assume you have a large amount of label data in a source domain. And then the target domain is where the test data come from. So for this, you can either have no labeled data in the entire domain, that's called semi-supervised setting, or you can have a small amount 15 of data. That's called supervised setting. And then you might have additional unlabeled data, either in the source domain or in the target domain. For main approaches, if you look at the supervised setting, here I just list some of those. It's not supposed to be complete. For example, model combination, what you do is you build a model for each domain and then you give each model different weights, depending on the similarity between the source domain and the target domain. Or you can give weights to instance, not the models. Or you can use the so-called feature augmentation, where you make a copy of the same feature. So basically, for each feature, you make three copies, one for source domain, one for target domain, and one for the general domain. And then the last one here is called the source structure correspondence learning. The idea is to say we have different domains we want to learn the relation between features in those domains. And there are more than this. For semi-supervised learning, we start with the self-training and co-training. That's pretty standard and then there has been a lot of work on training data selection and actually here -- two authors are here already. The idea here is to say that you want to select a subset of training data that looks closer to the target domain and then you train the model only on this reduced text set. And then we do something on the last one where we actually use up labeled data. So the idea there is to say okay, you can actually use unlabeled data and you can kind of add additional features and now that will give you more information about this pretty soon. So now let's look at the first study we did. Now let's look at two existing methods. One is training data selection. One is feature augmentation. There are pro and cons. So for training data selection, the pro is that because you only use a subset of data, you can speed up training, and this is really important for some application. For example, machine translation where you have like millions and millions of sentences, right. Ask the performance, in fact, when you use a subset of the training data, you can actually get better performance than using the whole dataset. The cons here is that for certain application, you might not have a huge amount of training data. Then when you select a subset, the unselected part is kind of ignored, right. So for some applications, you can actually get lower performance and I'll show you some example. Now for the second approach is this feature augmentation, right. The pro here is that it's very easy to implement. You just make three copies of everything. And normally, you get some improvement. The cons here is that it requires that you have labeled data in the target domain. So this is the supervised setting. For certain applications, you don't have label data in the target domain and also because you duplicate features, you basically triple the number of features, and that could cause a problem. So how do we address this limitation? This is actually a very simple 16 idea. So idea hire is to combine both. The way you do is that you use training data selection to kind of look at your source of domain data and divide it into two subsets. One is you select those. One you don't select that. And then for the one you select that, you'll see those data may be closer to the target domain. That's really [indiscernible] in the first place. Now you can just apply the standard of feature augmentation to those two pseudo domains. The advantage for this is that for training data selection, you can use the whole dataset, and for feature augmentation, you can pretend you have label data from the target domain. So it's a very simple idea. But we actually show some improvement. In this case, we look at two tasks. One is Chinese word segmentation, one is POS tagging, and we use a standard baseline, right, so we use CRF taggers. When you use a training data selection, you always have to decide what similarity functions you are going to use, and here we tried to, and you can try different ones. One is based on entropy, and one is based on coverage. Because what I mean by coverage here is to say that for word segmentation, one major issue is OOV, right, [indiscernible] recovery works. So what we do is we select the sentences from the source domain. We actually want to cover as many [indiscernible] grams in the test data as possible so that's what it meant. I'm not going to go into the detail about you I'll just show you the results. So we tried this on the Chinese Treebank 7.0. It has more one million words. The interesting part about this [indiscernible] is that it has five different genres. Broadcast conversation, broadcast news, and so on, so forth. So we choose one genre as our training test data, and then the training data will come from the other four. To just show you the result here, so here you see some lines here. The baseline is this. That means you just -- so this one is only you use only training data selection. So this is you use the whole training dataset, right so that's this line. And here, this X axis is the percentage of training data you use for your training. So you can imagine when you use ten percent, the performance is actually worse, right. So this bottom line here is when you use a random selection to select ten percent, and then this AEG, this is a measure that we use entropy-based selection. And this blue line is where we used this coverage-based selection. But the point here is that both selection methods [indiscernible] from random selection. That's not surprising. But when you use less than, say, 60 percent of the data, the performance are actually not better than the baseline because for this dataset we don't have a huge amount of training data. So therefore, so in [indiscernible] any data, it actually hurts the performance. But if you go up a little bit more for training data selection, you can, you know, get something better than using the whole dataset. So that's when you use training data selection only. And now suppose 17 you use both, right. So in this case, you use training data selection and then you use a feature augmentation. This is for POS tagging so that's a baseline. The baseline is you just use 100 percent of the data. And here the X axis once again is a percentage data you select. But now when you select ten percent, it is [indiscernible] 90 percent. It means you pretend 90 percent come from the source domain. So you still use a whole 100 percent, but you give them different base. So in that sense, it's like the instance [indiscernible] something like that, but it's just that you do the training data selection first, and then you divide it data into two sets and then you use the feature augmentation. So you can see you can get kind of a nice improvement. And the next one is for word segmentation. This is actually interesting in the sense that once again, this one is a baseline. That means you use 100 percent of data from the source domain. And here is where you -- you know, this one is used feature augmentation. But this line means that you duplicate every feature. So when you duplicate every feature, the performance is kind of -it's actually not much better than the baseline because you just have too many features, right. But this line is that when you duplicate the only certain kind of features, right. For example, for certain features that are not [indiscernible] so you only increase the number of features a little bit, not too much. So that way, you don't get hurt by this explosion of feature numbers. So you cab see you get some improvement. And this is actually what we want to see initially, right. We want to say when you select, say, 10 percent, 20 percent, you get a big improvement. For the ones you select more, suppose you select 60 percent, it's actually not much better than your baseline, because it's no longer that different, how much you select from your training data. So that's the first experiment. Now I'm going to move on to the second one. So the second one, we try something different. So in this case, what we are saying is that the setting is that suppose we use a semi-supervised setting, means you don't have label data in the target domain, which is the same as the previous one, but the idea here is to say what if you have unsupervised method. Sometimes unsupervised method can complement the supervised method, right. So and the way we are using that is to say basically, the idea is just to say you run unsupervised learning, you get some results, and the results will not be reliable, but then you treat those decisions as features and you use your source of domain data, which is labeled, to learn how useful those features are. And in this study, we also look at word segmentation and then for unsupervised learning method, we use this DLG. I'll give you very quick introduction about this. The method itself, the DLG, is actually not that important in the sense that you can replace this with any method you want to use for unsupervised learning. But just to give you some 18 idea about what's going on here is that for word segmentation, as I mentioned, the main problem is really [indiscernible], right? You have all those unknown words in your test data, and the one way you can measure this is to look at a certain string. So let me explain this measurement a bit. So suppose X is a corpus, right, and now you want to represent this corpus with some description. So DL means description length, right. And it's really, N is the size of the corpus multiplied by the entropy of the corpus. What it means is that suppose you have a vocabulary from this corpus, and you're saying, okay, I will just calculate entropy for this corpus, and that's your description, description length. The DLG is really a function of a string. So here, S is a string. It's a string in your corpus. So what you do is to say you have my original description length for the corpus, and now what if I replace this string with ID. So you say okay, this string, I treat this as a word in my dictionary, and now instead of having this whole string, I only need the ID to represent that. So now you get a new corpus by replacing the string with the string ID. But you still need to add a string to your dictionary. So there will be additional cost when you add something to your vocabulary, but [indiscernible], right, because now the description length will be shorter. If you don't get this part, it's okay, but the intuition is to say that longer, the longer the string is, the more frequent the string is in your corpus, the higher the score is, the DLG score will be. So therefore, you can imagine if you want to do unsupervised word segmentation, what you can do is suppose U is a string I'm trying to segment, then what I can do is I can try [indiscernible] ways to segment that, right. And now for each segmentation, I look at a string in that segmentation and I calculate the score and I can have certain kind of weight for this. But eventually, I will choose the one with the highest score. So therefore, you're basically prefer long string and you prefer frequent strings and you also treat those as words. So that's unsupervised learning in this case. But the problem with this is that sometimes you have a phrase, like, for example, you know, Hong Kong. I mean, Hong Kong, you can argue whether it's one word or two words. But sometimes you have certain kind of correlation, like could you please, right. So those will not be a word, per se, but they are very frequent and they are very long so you can treat those -- I mean, if you use this method, you treat this as one word instead of multiple words. So unsupervised learning does not even work that well but it's very good at finding those kind of new words you have never seen before. So what we do next is to say, okay, now let's use that finding for supervised learning. So what we do is that our baseline system is just word segmentation, you know, work segmenter. So we treat this as a sequence labeling problem. That means for each character, you decide whether you want to add a break or not. And here, 19 B1 means the first character in a word, B2 means the second character in the word, and so on, so forth. So you can treat these as a POS tagging problem and then we use a standard feature set. There's nothing kind of strange about this. And now when we added the DLG feature, what we do here is to say, okay, let's calculate the DLG score, okay. So what you do is that take the training data and take the test data. And for training data, ignore the word segmentation information. Assume it's unlabeled. Now you get a union and for this whole dataset, for every, you know, character we want to make it shorter so there will be some less constraint. But other than that, for every n-gram in this dataset, you calculated the DLG score. And then you kind of say let's see for each character in my training data or my test data, I want to form a new feature vector so this is the same feature vector as before, but now I'm going to add some new, additional features. So those features basically represent what is the decision if I use unsupervised word segmentation for this. And then you learn how useful those decisions will be. So, for example, suppose the sentence is C1, C2, you know, C3, C4, C5. You have five characters. And so now suppose we form the feature vector for the C3 then what you do is that you look at all the n-grams in C3 that contains these three, and now you collect the DLG score not from the sentence but from the whole corpus, right. So you get a DLG score and you take the log and you take the floor and you get the integer here. So you say suppose C3 end in this last column is the tag for C3 in this three. So here the word, if the word is only 3C, then the tag for that would be S. Means it's a single character word. And suppose the string is C2, C3, then the tag for that is going to be E. E means the last character in your word, so on and so forth. So basically, you collect this from your training and test data. And now what I'm going to do is to say okay, suppose C3 belongs to a word of length two. Right. Suppose. Then there are two possibilities. It's either C2, C3, or C3, C4, and I look at those scores and I make a decision. I choose the highest one, right. So if the highest score will be 2 and the tag for this will be E. Right. So I'm saying if C3 belongs to a word of length two, then the decision based on unsupervised learning will be the tag for that and it will be E. And then the strength for that will be two. So I create a new feature. You can try different ways to create a feature. But the point is that you add also additional features and add it on top of your existing features for supervised learning. And now just to show you the result here, in this case, the tested genre is BC. So BC is broadcast news. And then the training genre will be either BC or something else. So if it's BC, then this is training in the same domain, right. So if you use the baseline here, you know, that's 93.9. When you add the DLG score, although it's the same domain, you still get an improvement, right. And the 20 web is out of the other four genres, web is closest to BC when we try this domain application stuff so that means, you know, if you use a baseline, it's this number. And if you add a DLG, you get some improvement. NW is news, I guess, something like that, and this is most different from broadcast so you can see the baseline is much lower. But then you add the DLG, you actually get a bigger improvement. So that means [indiscernible] domain are more different, you actually get a bigger boost of adding unsupervised learning results. You can also combine segmentation with POS tagging, right, and I'll speed this part. But it's just that when you do this, you can also add a DLG on top of that. This one is just to allow us to compare with existing work. So you can see those are some previous work. This work was published like in 2012. So that was the latest result at that time. So you can see this is a result for segmentation. This is a result for both segmentation and POS tagging, so when you add the DLG, you get some improvement and you always get some improvement. So eventually, the result is about, you know, the best result at the time. So that's the second approach. The third one is really the most recent one, and I think this is actually interesting in the sense that here we are not talking about two domains. We are talking about two languages. So the idea is to say what if you have two languages that are closely related, right. What happened is that very often, we know that operating corpora is very expensive. So we have resources only for certain languages. But there are fewer resources for ancient languages for obvious reason. First, there's no money there, right, so nobody wants to work on that. Second, we don't have a lot of data, so on, so forth. So the idea -- yeah, you have one speaker, no speaker. So the question is can we actually improve the NLP system for ancient language when you have resource for modern language. So we want to kind of take a look at what we get. In this case, there has been some previous study so here I list some of those. And most of those studies, based on spelling normalization. So, for example, in middle English, you spell a word this way, but the [indiscernible] empire, you spell that way. Here we want to do something entirely different. So the language we used is archaic Chinese and modern Chinese. And once again, we focus on word segmentation and POS tagging. So the idea here is to say that you want to find the properties shared by those two languages. And therefore you, want to kind of explore that information. To just tell you how different the old Chinese is, right, so if you look at based on this book, of course, there are different ways to divide Chinese. But there are four eras for Chinese. So here give you the type period. So basically this archaic Chinese was spoken before 200 AD. And modern Chinese, you can look at the difference. It's like huge difference. And as a result, modern Chinese speakers will not be able to [indiscernible] archaic Chinese with our training. For example, I 21 remember in middle school we actually have to learn how to understand those text and that's my favorite subject at the time, because anything else I learn in Chinese school is totally useless. That's actually a separate story because of the way they taught Chinese, it was awful. But anyway, that's the reason I did remember, right, like those [indiscernible] are really different. And what we did is that we look at one book. So this book is actually a collection of papers or thesis written by this guy and his, you know, retainers in this time period. The interesting part about this book is that it has 21 chapters covering a wide range of topics. So you can imagine they are from different domains. So therefore, it you give you a very good picture about what people are interested or were interested at this time. So we have this book, and then my colleagues, they create this corpus. So they did the word segmentation. They did the POS tagging for that. And the way they did that is they start with the Chinese Penn Treebank guidelines, because every time you create a corpus, you need guidelines, right. So they base on the same principle, in the sense that you talk about word, there are actually different definition of words, depending on what you are trying to do. So they follow the same principle, but the decision can be different. For example, this word, right, this string, the first character means country. The second character means family. In modern Chinese, it's one word. It just means country. But in old Chinese, actually it's very often used as two words. So it's really country and family. It's interesting debate why country plus [indiscernible] may become country only, but that's a separate story here. That's Chinese philosophy, right? Country is always more important than family, which I kind of disagree. But this is just a slice of the corpus, right. So the number of characters, number of words, number of sentences, and average sentence length. Now the question is that suppose you have, you know, resource for modern Chinese. Can you use that to improve the performance. What we did is that the training data, we assume we have some label data for this old Chinese corpus, but we also have a modern Chinese corpus. So here we used Chinese Penn Treebank, version 7. And then for test data come from this old corpus, right. So we use five-fold cross-validation. Four fold for training and one fold for testing. You can imagine what kind of baseline we have, right. We can just use the hidden domain data only so in that case, it's, you know, the old Chinese only, you can use the modern Chinese data only or you take the union so that's the obvious baseline. So for our approach, basically to tell you the story right now is that if you just use a union, you actually don't get any improvement, as you can imagine. So what we want to do is something fairly more clever than that. So we want to find properties shared by those two languages and then kind of represent those properties as features and then 22 build a system. The question is what properties are changed by those two languages. So we did some study, right. In this one, it's just some statistic, but the most important part is here, the average word length. So you can see for modern Chinese, the average word length is 1.6. For old Chinese, it's 1.1. And it's actually consistent with our understanding about the Chinese as a language so it goes from monolingual -- a language of monosyllable words to more kind of words [indiscernible], right. So basically, you can see this is the kind of difference. So you can imagine for word segmentation, if you want to do -- you want to get a word segment for old Chinese, you can just add rake after every character, and you will be 90 percent correct. So that's one thing. And this one is just to give you a breakdown. So you can see for old Chinese, 86 percent has only one character, right. So that's a percentage of word tokens with only one character, so on, so forth. So for modern Chinese, it's kind of half has only one character and about half has two. And then you have kind of a tail for longer words. A now if you look at POS tagging so this is the percentage of word tokens [indiscernible] tag in this corpus. So for NN, that's noun. So PU is punctuation and VV is a [indiscernible] and so on, so forth. So AD is [indiscernible]. So you can see for the first four categories, there are about, you know, there are roughly [indiscernible], right. That's not surprising. But I want to show you something else. There are actually many POS tags that do not appear in old Chinese. Some of those are due to content, right. You can imagine the [indiscernible] are talking about different things like there are no URL at the time, there are no foreign words for obvious reason. But then there's something else. For example, BA, right. BA is a POS tag for this word, ba, in the word ba-construction. And it turn out this construction did not really appear until like 200 years after this book was written, right, so there was no construction. The construction was not there at the time. And similarly, for those POS tags, they are for this DE, right. So there is this special Chinese character. Actually, there's three that are all the call DE, right? They actually have a different POS tags, because they behave very differently. And that's also a kind of new phenomenon in the sense that they did not really appear, they did not have that usage until much, much later. Similarly for aspect marker, right, it is very different. So therefore, you can see if you just use a POS tag distribution, they actually look pretty different. So therefore, you know, if you're building your POS tag model from modern Chinese and use that for old Chinese, the result could be really bad. So will there be anything that they actually share, right? One thing they actually share is the character set. What I mean by the character set, I don't mean a set of the characters in the sense that we could say how 23 many characters are there in Chinese. Nobody knows. Like I don't think there's any single person in this world really know, right. The conservative estimate will be something like 60,000 characters, right. But the normal daily conversation or if you look at even the Chinese Penn Treebank become a member of character bank is only 5,000. So that there will be, if you know 8,000 characters, you'll be fine. But then, you know, very often I will see people's name that I don't know how to pronounce because I just don't recognize that character. I don't think there will be a single person in the world know all the characters or know how many characters are there. But then the meaning of characters normally don't change that much over the time so this is one character means top or up. It can appear in this word, means top, right. So that's a localizer. It can also mean, this one means climb. Climb up. So this one means climb up. And that this one means Shanghai, right. So this is just a name. So you can imagine this character can appear in different words and those words will have different POS tags. But now the meaning of the character actually doesn't really change that much over the time. Therefore, if you say for the meaning, sometimes [indiscernible] related to the meaning, because action very often is a word. Of course, it's not always true. But now it's supposed to be used for [indiscernible]. This is just a POS tag of a character. So what we do is we sea okay, for each character, there will be one word, multiple tags for that, and you want to learn those tags. So here is to say let's just [indiscernible]. We actually don't have the cTag labels in our corpus so we just make very simple assumption. So we are saying suppose this word is LC, right, localizer, we just see each character also have LC as the cTag for that. And now you're just kind of coming to your corpus and kind of run the frequency. So you can see for this character those time is part of localizer, right. So that's for the modern Chinese and that's for old Chinese. So you can see although the frequencies are different because the corpus size are different. But if you look at only the top one, that's like the core meaning, that actually remains roughly the same. So we are saying maybe that's something they actually share, right. Also something shared by the two languages is some word formation pattern. So, for example, for suppose a word is a noun. Suppose it has two characters and suppose the cTag for each one is [indiscernible] so we will write this as this pattern. So here I give you by example. So this one means politics and this one means place and politics place means government, right. So there are different kind of word formation patterns. And here that's the raw count in those corpus and this is the percentage. So now you can see the percentage is not so similar, but you can see similar patterns in those two corpus. So maybe the patterns will also be useful. That means the cTag can actually give some information about the POS 24 tag of the whole word, right. That's the idea. Now let's see just to summarize, the two corpora actually differ a lot with respect to word length distribution or POS tagging, but they share a lot of the same information about cTag and also word formation patterns. The hypothesis we have is to say if you just add CTB theoretically to your training data, very likely the performance will not improve. But if you add cTag as features, maybe you will get some improvement. Just to show you the result, once again, [indiscernible] word segmentation and POS tagging, and here just to show you the size of the pertaining data and test data and here is only for the old Chinese. I did not include a number for CTB. CTB, of course, is huge. It's like one million words, right. So we use CTB only as a training stage, as you can imagine. This one is just to use which [indiscernible] we use when we do word segmentation. So we record those tag as position taking to distinguish that from the cTag. A cTag is a POS tag character and position tag is just to say whether, you know, it's just the beginning of a word or the end of the word, so on, so forth. We use the standard CRF tagger for this. And then we just use our standard feature set but now we add the cTag. And the cTag, the frequency of those cTags will come from training data only. Otherwise, it would be cheating. So now, this is the -- we can record those basic features. Those are just character unigram or bigram. This is a cTag feature. The zero here means if you look at the current character and we look at the most frequency tag for that character. And the DLG feature, I mentioned DLG before, those are just additional features we used here. Just to show you the number here, let's look at the first group here. So now suppose you only use a basic feature, BF means the basic feature. Means you only use a character n-gram, unigram or bigram. So this one means you use the basic feature from the old Chinese corpus only. So that's the baseline. And this is if you use the data from modern Chinese only CTB. So you can see this result is awful. It's like awful. It's worse than the baseline where you just add [indiscernible] after every character. And this is the result when you take the union. So it's not that awful, but still it's worse than the baseline. And so now we say now for the baseline features, we only get from the old Chinese corpus and how what about cTag? So when you use a cTag from the old Chinese corpus, the result is actually a little bit worse. Not much worse than this. And the reason for this is that this corpus is pretty small and the cTag, because we used this very simple way to get a cTag, it's not very reliable and so on, so forth. But if you get it from the Chinese Penn Treebank, the modern Chinese, the improvement's more but it's actually not getting worse. And now if you add a DLG feature, you get some improvement. So you might say oh, maybe that's improvement. And the reason for this is really you have, you know, more 25 training data. What if you have much less training data from the old Chinese? So for this experiment, what we do is that we look at the training data for old Chinese and then we only take a small percentage. Instead of 100 percent, we take, say, ten percent and now you can compare the performance. So this line, the red line is for you only use old Chinese corpus, and this one is when you add the cTag feature from CTB, and this is the line where you add the DLG feature, right. So you can see when you only use ten percent of training data from the old Chinese corpus, you actually get a much bigger improvement because now you really very little amount of data. So for modern Chinese, even though it's very different from old Chinese, it give you a bigger boost. And similarly, for POS tagging, we get basically the same kind of result. So for the feature set, same thing. It's just that we use a word unigram and bigram, right, and then you can add a word affix. For Chinese, it's hard to decide whether something is affix or not. So what we do is that the prefix is just a first character of a word and a suffix is just the last character in that word. And then the cTag of affix is just to say what is the cTag of that character. So the same kind of result, right. So for basic feature, if you only use the old Chinese corpus, that's the baseline. If you use a modern Chinese you, get a worse result. When you use a union, it hurts, right. And now when you add a cTag, you get some improvement. And if you use a cTag actually from the modern Chinese [indiscernible] improvement. And once again, if you add affix feature, you get some improvement. The same chart as before. So in this case, you have less training data from the old Chinese. You get much bigger improvement here, right. So it's like four percent here where you just add the data from the modern Chinese. So what I mean here is just to summarize those two languages, I mean some people would say oh, they are really the same language. We don't want to argue about that. I will say that different languages because I cannot understand this one result learning what I learned in high school or middle school. But they are actually very different, right, with respect to the word length or POS tag distribution. As a result, if you add modern Chinese data directly, it does not really help. But if you find what they really share and then you [indiscernible] features, you actually get some improvement, especially when you have very small amount of data from old Chinese. So to just summarize, I kind of introduced three different methods, and they kind of look at different angles, right. So the first two, we used semi-supervised setting, which means you don't have label data from the target domain. So we can either, you know, combine those, right. This is [indiscernible], or you can use unsupervised learning and then use that as a feature to train your supervised method. And the last one look at two kind of closely related languages and I think the key point here is really you have to find out whether 26 you share because they can be very different. And for future work, the question is that I'm sure there are other features that we can use. And we want to identify those and a more interesting question is whether those features can be identified automatically without prior knowledge of the domains or languages. In a sense that if you gave me the domain and give me the data, can I actually know, without using my kind of expert knowledge, to know what features I should look at. And then we want to apply this to kind of other tasks. Thank you. >>: It seems ancient Chinese and modern Chinese are in this unique relationship in that one is a descendant of the other. A more interesting case, I think, would be, more applicable would be ones where they're sister languages, where they actually share a common ancestor. Now, in finding an ancestor where you actually have good data like this would be difficult. So you might, in some families, able to find that. So being able to then, you know, establish a relationship with that ancestor and then [indiscernible] through that ancestor to some other modern language that is a resource would be interesting, because that is directly applicable to a [indiscernible]. >> Fei Xia: Yeah. >>: The problem of the ancestor, I guess, you know, you could get into a reconstruction kind of thing too where you can basically -- and this has been done a lot, obviously, in linguistics is reconstructing some common ancestor. If you then associate with that reconstructed ancestor and then somehow use that information from that ancestor to help inform descendant languages, that would actually be very interesting. >> Fei Xia: Yeah, I'll just say there is this one, actually, [indiscernible] but there are actually middle English tree bank done by Penn and also not only middle English, but they also have middle maybe German or high German or something like that. They actually have kind of ancient -- not ancient, but middle languages for several languages so I think that would be interesting. Suppose you have middle English and middle French, I don't know. >>: I think the problem there, okay, so you could use similar techniques here. Again, the problem is what are the characteristics, what are the features that you need to reconstruct the tagger or something for one of these ancestor languages. 27 >> Fei Xia: Right. >>: But in a way, what are the descendants of the modern -- excuse me, of middle high German or whatever it is. Yeah, middle high German, what are the ancestors, whether it's one -- I guess there are multiple. It's not just high German, but are we [indiscernible] or something. But you obviously can go further. What's nice about this is you're going very far back in time. I mean, this is 2,000 years ago when it's actually -- it's phenomenal. >> Fei Xia: Thank you. >>: There's a language largely in Kenya called Sheng, which is a slang version of Swahili and English mixed together. They were going to call it Dheng with a DH in there, but it's Sheng. And the corpus for Sheng comes in the form of what's cool, you know, social media, advertising, pop songs. But it's always changing because it's slang and it's current. So if we had to, like, take a snapshot of Sheng in time, and use the established corpora for Swahili and English to try and tag Sheng, forgetting the fact that it's always changing, would this approach be useful, since Sheng is still considered a resource for right now [indiscernible] any particular snapshot of Sheng is a resource [indiscernible]. >> Fei Xia: Yeah, I think that the key kind of challenge or something I really want to look into is to say can you really identify those properties kind of automatically, or without knowing the language, I think the reason we kind of look into this is that as a Chinese speaker, I have this intuition that the [indiscernible] meanings do not really change too much. But on the other hand, I see if there's any way to kind of find out what features are really shared. I imagine one could definitely run some experiment to see this, and also depends on what kind of changes, just spelling or just the alphabet, or is that grammar? Like what is changing? For example, what we did here -- actually, I'm saying for parsing, I would imagine because word order change so much, but there could still be something that is continuing. But I'm saying it's not [indiscernible] to find out what exactly are shared by languages. I think that's kind of ->>: Multi-heritage languages, coming from two different language families, like [indiscernible] children or, you know, things like that. You wouldn't expect [indiscernible]. Now you have it. But what is one [indiscernible] parent of the language [indiscernible] one parent is Arabic and one parent is 28 [indiscernible] what do we do now. So could we approach it even then and say oh, look, we have [indiscernible] corpora or we have the [indiscernible] corpora, we have [indiscernible] corpora, now we can get Urdu corpora. >> Fei Xia: I think one thing I'm always fascinated about but never got time to look at is language code switching, in a sense that when you code switching, you're here only talking about two languages, but when you have multiple languages, the question is what that is being switched, right. Is that a word, is that the syntax, what is being switched. So now when you have a language that come from multiple parents, which part you get from parent one, which part you get from parent two, and what's the interaction. I think those will be very interesting questions to look into. But that's something that I always feel, okay. I can look into that when I have like free time, I can look at that. But that's something I'm always fascinated about, yeah. >>: I don't think that anything in your methodology says that these have to be sort of parent/child relationships. Do you have any [indiscernible] that do this on like very closely related like something like [indiscernible], which are not quite mutually intelligible but really close? >> Fei Xia: There has been lot of work done for related language pairs. For example, this Google tree bank [indiscernible] I guess maybe you heard of that already. So basically have the version one have six languages and for version two, I think they have like nine languages. And then what they did, you can imagine, you can train on one language and then just use a model but do the [indiscernible] version. Means you only have the rules but no [indiscernible] items and parse the other one. And conceptually you say if those languages are very similar, then the performance should be better. And that's consistent with some funding they have. But I think definitely, you don't have to -- I mean, the two languages do not have to be in this kind of ancestor descendant relation. It can be related and now the question is that if you know they're related, certain things that are shared, then you can use that. But certain things you know they change already. For example, you know one is SVO, one is SOV, like what do you do. Like do you kind of do pre-processing where you're doing the model or doing it in post-processing. Like how do you incorporate the knowledge you learned, and I think that will be very interesting in the sense that you can do the first run knowing nothing about the languages. But after first run, you actually know something about the language and then you have that knowledge and then how do you incorporate the knowledge you just learned. It's like what [indiscernible] like from old data, you get some 29 information about the grammar and you want to have the second to do something better, right, because you know the language already. But definitely, the interest is always the time. It's not my interests. Interested in a lot of stuff, but in the short-term, I might not be able to work on that. >>: Actually, it's a really interesting case, because with the olden data, using data from all the different languages, 1,200 languages, the alignment data, basically, helped inform for languages that didn't have much data. You could actually use the alignment information from the other languages to help inform the tools you were developing for the resource language. So even if the languages aren't related, there is information you can pull in if you have enough data and enough information from -- enough signal. >> Fei Xia: It's almost like when you say nearest neighbor, right. I talk about nearest. You use all the neighbors and see the difference and somehow take that into account. So therefore, in theory, we don't really require any kind of relation between the languages. It's just the closer they are, the more likely you get some improvement. >>: More shared features. >> Fei Xia: More shared features, right. >>: Backtrack to a more technical question. In the first portion of the technique, when you were doing training data selection, how were you actually deciding which percent, which portion of the ten percent of the training data to use? >> Fei Xia: Oh, which portion of ten percent? percent or 20 percent, like how do you decide? >>: Well, yeah. decide which? Like you are saying is that ten If you're taking ten percent or whatever percent, how do you >> Fei Xia: Oh, that's where you have to define the similarity function. So what you do is you have training data, you have test data, and then you compare how similar they are. For example, you can use entropy, right, and saying I'm doing the language model based on one domain and then test the other one. So we have a very good paper on that. So you can use any similarity function you want to use. But the idea is to say you look at each sentence in your source 30 domain and compare that with the target domain data and see how close they are and then define what you mean by close.