17289 >> Jianfeng Gao: Good morning. We are very glad to have Yang Liu here to give the presentation about their work. Let's give a very brief introduction to Yang Liu. Yang Liu is assistant researcher at the Institute of Computing [inaudible] Chinese Academy of Sciences. He received his Ph.D. degree in computer science from ICT in 2007. His major research interests include the statistical machine translation and Chinese information processing. He has been working on syntax-based modeling word alignment and system combination. His paper tree-to-string translation won the Meritorious Asia NRP Paper award According [inaudible] 2006. He served as reviewers of TLP, TSRP, GNG, ACEMP, AMTA and SSST. [phonetic] let's welcome Yang Liu. [applause] >> Yang Liu: Thanks a lot for the generous introduction. Hello everyone. It's my honor to be here and give a talk to introduce our major work on statistical machine translation. The title of this talk is an overview of tree-to-string translation models. This is the outline of my talk. First I will give a brief introduction to our group. Then I will present four tree-to-string translation models. Tree-based, tree sequence-based forest-based and context-aware. And the talk ends with a conclusion. Our institution is called Institute of Computing Technology Chinese Academy of Sciences. It is located in China, Beijing. And our NRP group is led by professor Chin Liu. And there are five faculties staffed in over 20 Ph.D. and master students in our group. And our recent areas include machine translation, lexical analysis processing, information retrieval and information extraction. We started our SMT research in 2004, and we have been working on the following directions: Syntax-based models and maximum entry-based reordering word selection, word alignment, system combination and domain adaptation. We published a number of ACR and EMRP papers on syntax-based models. We propose a series of tree-to-string models in recent three years. Tree-based model in SR 2006. Tree sequence-based model in SR 2007. And the forest-based model in ACR and EMRP 2008 and the context-aware model in EMRP 2008 and dependency-based tree-to-string model in WMT 2007. We have paper published in SR 2009 about forest-based tree-to-string translation. And we also published many papers in other directions. We proposed maximum entropy-based reorder model for BTG. That is bracketing transduction grammar, and we also used maximum entry model to take contextual information into account to have select appropriated rules in decoding, both for higher and tree-to-string model. And we proposed one of the first discriminative word alignment method in the ACR 2005. And this year we propose to use weighted alignment matrix to help improve statistical machine translation. Regarding system combination, we have two papers published this year. One paper is about joint decoding with multiple translation models. We just try to directly combine different systems in the decoding phase. In other words, we try to derive a joint decoder for multiple systems and another paper is about replacing confusion network with lattice in system combination. And we have a paper about domain adaptation in EMRP 2007. We have been very active in recent machine translation evaluations. In this year's NIST evaluation we participated in the Chinese-to-English track and won the surplus of the system combination. And we also achieved very good results in last year's WSLT evaluation [phonetic]. We have successfully turned our research into profit. We collaborated with some corporations and developed a patent translation system. We also developed an SMT system that translates travel expressions in real time mobile devices. Now I begin to introduce our four tree-to-string models. So what is tree-to-string? In our tree-to-string model, a tree is first a tree on the source side and the targeted side is a stream. Often we use word alignment to indicate the correspondence between the tree and the string. So our hope is that we can exploit the syntactic information on the source side to direct translation. So our tree-to-string model is closely related to the work by USCSI and Microsoft Research. In 2001, Yamada and Knight proposed a noisy channel model for string-to-tree translation, and in 2004 Getty and others proposed a GMHK [phonetic] algorithm to extract string-to-tree rules automatically from annotated training data. And in 2005 Clerk and others proposed a tree system. They use dependent tree on the source side. And in 2006 Galley extended his GMHK algorithm to obtain composed rules and handle aligned words in a better way. So inspired by Clerk and Galley's work that we proposed our tree-to-string model in [inaudible] 2006. So similar to Galley's work, we use a free structure tree but in the reverse direction. And like Clerk's work, we put emphasis on the syntax on the source side rather than on the targeted side. And later this year Leng Waun [phonetic] also proposed a very similar tree-to-string model. And we think that the two work actually equivalent. So based on our work on 2006, we propose three extended tree-to-string models. Tree sequence-based in 2007 and the forest-based in 2008 and contact sensitive model in 2008. Now I will first introduce original tree-based model and then give a brief introduction to the three extended models. There are two basic problems in tree-to-string model. First how to extract tree-to-string rules automatically from the annotated training data. And the second how to decode with these extracted rules. So I will first -- we will first discuss rule extraction algorithm. The input of our algorithm is a training example. This is source side past and water-aligned sentence pair. So how to extract tree-to-string rule from this training example. Basically our algorithm is quite similar to that of higher, hierarchical-based system. We call that when extracting hierarchical phrases when they first to identify an initial first pair and then subtract sub [inaudible] first pairs to obtain rules with variables. So the difference here is that we require that there must be a tree over the source phrase. So it is tree string pair rather than a string-string pair. For example, consider this Chinese word [chinese] and we will examine whether there is a tree that dominates this phrase. So we can find a tree rooted at node NR. So this is syntactic phrase. And we can find a corresponding phrase [chinese] with the alignment information. So this is a tree-string pair. And this pair is consistent with alignment. So we can extract tree-to-string rule here, directly use this tree-to-string pair as a rule. The left-hand side of this rule is a source tree. And the right-hand side is a string. So it is a tree-to-string rule. And, similarly, for the second source word, and we can find a tree at node P and corresponding targeted phrase is "with." So it's also tree-to-string rule. And for [chinese], similarly we can also extract a rule. However, for the source word [chinese], we cannot extract tree-to-string rule because source word [chinese] outside the tree-to-string pair is aligned to the targeted word [inaudible] inside. So it is not consistent with word alignment, so we cannot extract a rule here. Similarly, we cannot extract rule for [chinese] because [chinese] is aligned inside here. And for [chinese] we can also extract a rule. And then we will examine the two-word phrases. So first we consider this phrase [chinese], and we find that we cannot find a tree dominate this source phrase. So this is not a syntactic phrase. And we cannot extract a rule here. And for [chinese], we can find a tree over the source phrase and the targeted phrase is "with Sharon". So we can extract a rule here and then like higher, we can subtract some sub tree to string pairs to obtain rules with variables. For example, we can remove [chinese] and NR here and "Sharon" we can replace them with variable X and obtain a rule with variables. So in this way we can extract many rules from this training example. Some just with -- some who just have terminals and some have both terminals and non-terminals. And some rules just have non-terminals. Okay. So the second problem is how to decode with this tree-to-string rules. So the input of our decoder is a source tree. So our job is try to decompose the tree into many small tree fragments, and then use extracted rules to match this tree fragment and it forms the translation. So our decoder runs in a bottom-up order. First we consider the node NR and we search in the rule table and try to find a rule that matched the tree rooted at NR. Suppose that we find a rule in the table [chinese] to [chinese] and the source tree of the rule exactly match the tree rooted at NR. So we can take the targeted side of the rule [chinese] as the translation for the node NR. And then we consider its parent NPB. Suppose we find a rule that has one variable. And this rule just partially matched the tree rooted at NTB. So according to the rule, we can replace X-1 with the translation of NR. So the translation for MPB is also [chinese]. Similarly, we can find the translation for PP. Translation is "with." And for NR it's also exact match. So its translation is "Sharon". And for NPB it's also "Sharon." And for the node PBB, the rule partially match the tree. So we can replace X-1 with the translation of NPB, "Sharon," so its translation is "with Sharon." So we can translate the other node in a similar way. The node gives the rule used. So [chinese] translation is hold, has, talk, talk, had a talk, and had a talk with Sharon. So finally for the root node IP we find a rule that partially match the tree. And then according to the rule we can replace X-1 with the translation of NPB and replace X-2 with translation of [inaudible]. So its translation is "Bush had a talk with Sharon." So actually this is just a toy example to illustrate how the decoding process is. In our real decoder we will not start a string at each node. Actually, we have [inaudible] for efficiency. So we compared our tree-based model with Ferrell on the NIST 2005 Chinese-to-English test set and the absolute improvement is about .9 blue points and the difference is statistically significant. Okay. Although very promising, this model faces several problems. First, tree-based tree-to-string model impose a syntactic constraint on the source side requiring that there must be a source tree over the source phrase. So making many bilingual phrases inaccessible. So this will decrease translation quality dramatically. And second, passing is very important for syntax-based models. So if the tree is earphone -- if the translation is wrong, the translation will be wrong too. The third problem is that the tree-based model does not take contextual information into account. Currently, this is not a big problem because many systems are context-free. But it will be very interesting if we can take contextual information into the tree-to-string model. So accordingly, we proposed three extended tree-to-string model, tree sequence-based model, forest-based model and the context-aware model to alleviate the three problems. The tree sequence-based model is designed to alleviate the rule coverage problem. As mentioned above, tree-based tree-to-string model requires that there must be a tree over the source phrase. So for this phrase [chinese], we cannot find a tree that dominates this source phrase. So [chinese] dominates [chinese]. And AS dominates [chinese]. And [chinese] dominates [chinese]. There's no tree that can subsume this source phrase. So in tree-based tree-to-string model we cannot extract tree-to-string rule here. However, this bilingual phrase [chinese] and [chinese] is consistent with alignment. So it is a valid bilingual phrase. It can be used by Moses or by Ferrell, but it cannot be used by our system. And [inaudible] and others 2006 report that about 28 percent first pairs are not syntactic on the English/Chinese data. So losing such nonsyntactic first pairs will decrease translation quality dramatically. So it's very important for tree-to-string model to capture such nonsyntactic first pairs. So our solution >>: Excuse me, just to be clear, though. You can capture a phrase with a hole in it that says [chinese] with a noun phrase that gets translated? >> Yang Liu: Yes, that's true. >>: But you can't find just the [chinese]. >> Yang Liu: Yes. >>: Requiring it be followed by a noun phrase. >> Yang Liu: Yes, but we require more context. It must contain [chinese], yeah, we can include this information by a bigger rule. >>: And you can also then generalize out the [chinese] to just be any noun, right? >> Yang Liu: Yeah, yes. >>: Okay. >> Yang Liu: Yes. >>: But then you require to have a single object, for instance, and it can't have -- it's a brittle rule? >> Yang Liu: Yes. >>: Okay. Thanks. >> Yang Liu: So our solution is that we allow several trees rather than single trees here. So we call this tree sequence. It's just a sequence of trees. So if we are allowed two trees over the source phrase, this will be valid tree sequence rule. [inaudible] is not a valid tree-to-string rule. So this is a new rule. We call this tree sequence to string rule. So now in our new model we extract both tree-to-string rule and the tree sequence to string rules. We use tree-to-string rules to capture syntactic first pairs and use tree sequence to string rule to capture nonsyntactic first pairs. So in principle there's no loss in rule coverage. We can use all the bilingual phrases that can be used by Moses. So in decoding, our input is still source tree. And we use both tree-based and the tree sequence-based rules to match the input tree. So due to time limit I will not discuss rule extraction algorithm and the decoding algorithm in detail here. We compared our tree sequence-based model with Ferrell in our tree-based model. So on the NIST 2005 Chinese-to-English test set and absolute improvement over Ferrell is about 2.2 blue points. And the improvement over tree-based is about one point blue points. Okay. >>: I know you didn't want to talk about decoding, but can you tell us if there's a computational hit, or is it basically ->> Yang Liu: Yes, yes. I mean, the decoding speed is much slower. Maybe five times slower. >>: Okay. >> Yang Liu: Yeah. Okay. I can give some detail here. In our tree-based decoding, we just, with this node here, yeah, every node. But in our tree sequence-based decoder with spend, with every spend it's a tree or sequence, so we will visit more spans than the tree-based decoder. So it will be more computational expensive. >>: With a single tree, there's only a linear number of nodes but a quadratic number of spans, order of magnitude? >> Yang Liu: Yeah. So actually in our tree [inaudible] we use a chart to store the [inaudible]. And tree-based we just use some stacks associated with each node here. >>: So is this considered tree sequence? >> Yang Liu: No. >>: Must be the node. >> Yang Liu: Yeah. This is not considered with alignment. Because aligned here and some phrase aligned here. So it's not considered with alignment. >>: Isn't the tree sequence very similar to the phrase of ->> Yang Liu: Yeah, actually ->>: So is it equivalent, if we extract only the tree sequence? They should be the same as ->> Yang Liu: Actually, it depends on the definition of tree sequence. In our original definition in this work we require that there must be tree over the phrase. But [inaudible], he applied our tree sequence to tree-to-string translation. So the definition is much looser. So the tree -- the first pair as tree sequence. So there's no tree over the source phrase. So in this paper I don't think that the tree sequence rule is equivalent to first pair, because there must be a tree over ->>: So given B aligns to pair [phonetic] just assume. >> Yang Liu: Assume that there's a link. >>: That there's a link, even in that case [chinese]. >> Yang Liu: Cannot be translated. >>: Says no tree. >> Yang Liu: No, no. Because you don't hear ->>: There's no link here, but there's a link here. >> Yang Liu: Yeah, yeah. We can extract tree sequence rule here. Yeah. We can. Yeah. It's just word alignment. So we can only translate -- if it's aligned to here and there's no links aligned to here, we can extract. >>: Even if they don't get on to the same sub tree? >> Yang Liu: Yes. And the third model is packed forest-based model. We know that parsing is very important to syntax-based models. On the [inaudible] of matter for English is around 90 percent. And for Chinese it's around 85 percent. And just accuracy will go down dramatically when handling real word text because of the domain change. So we propose to use packed forest to replace one best trees in our model. This idea was inspired by Clerk and others 2005 and he mentioned we can replace tree with packed forest. So actually there are many past trees for sentence. So suppose for this sentence we have two different past trees, and then we can pack the two trees into this structure by sharing common nodes and [inaudible] so this is called a packet forest. And a packet forest can store exponentially many trees in just a polynomial space so we can replace the one best trees with packed forest with our tree-to-string model both in the rule extraction and the decoding. So for extracting tree-to-string rules now we have a forest string pair with word alignment. So how to extract tree-to-string rule from this training example. Can we still use rule extraction algorithm for tree-based model? The answer is no. Recall that for the tree-based rule extraction we need to first identify source phrase and then examine whether there is a tree subsumed in the source phrase and then we try to find the target phrase. However, in the packed forest there are exponentially many trees over a source phrase. This is just a toy example. We have two trees that dominate this sentence. In our real training data, often we have millions of trees over a string. So it's impractical to enumerate all the trees explicitly to find the tree string pairs. So instead we resort to the GHKM algorithm proposed by Mitchell Galley. So most important idea in his algorithm is try to find so-called frontier node, which indicates where to cut the forest to obtain tree fragment. And the two form the tree-to-string rule. So the red node are frontier node. So what is a frontier node? It's actually very simple. If the first pair subsumed by a node is consistent with alignment, it's a frontier node. So for the node VPB, it dominates first pair to [chinese] to "here to talk" and this is a valid bilingual phrase which is considered with alignment. So VPB is frontier node. On the contrary, for the node [chinese], yeah, the corresponding first pair is [chinese] and "Bush had a talk with Sharon." It's not consistent with alignment because [chinese] outside the tree string pair is aligned inside. So NP is not a frontier node. So now given the tree annotated with frontier node, how to extract tree-to-string rules, it's very simple. First we will visit every frontier node. For example, [chinese] and [chinese] has two incoming [inaudible], so consider the first [inaudible], it has two tail nodes. The first is NPB and the second is [chinese] and both NPB and [chinese] are frontier node. So we can stop here and check the alignment information and we can extract a rule there. So this is called a minimal rule. And then for the second [inaudible], we will first check the [inaudible] and its two tail nodes are NP and VPB. VPB is a frontier node and NPB is not a frontier node. So we have to keep examining its incoming [inaudible] and arrived at NPB, CC and NPB. And all the three nodes are frontier nodes so we can stop here and extract a tree-to-string rule. Similarly for the node VPB, its two tail nodes are frontier nodes. And for this node NR is also a valid tree-to-string rule. So in this way we can extract all minimal rules and then we can combine the minimal rules in different ways to obtain compose the rules. So for decoding, the input is packed forest rather than tree. So how to match the forest to find the translation. Actually, the decoding algorithm is almost the same. The difference is that a node might have multiple incoming [inaudible]. So still we consider the node NR and we find a rule translate Bush and the rule exactly match the tree rooted at NR and we take the target side Bush as a translation of NR. So if a node just have one incoming [inaudible] like NR, NPB and NPP, the decoding algorithm is actually the same with tree-based decoder. So we can easily find the translations like tree-based decoder for this node. Sharon and Sharon and Bush and Sharon. These three nodes and the PP with Sharon. "Hold," "has" "talk." "Talk." "Had a talk." So the difference is root node. IP has two different incoming [inaudible]. So our strategy is that we handle the [inaudible] individually. First we consider this [inaudible] and we will search in the rule table, try to match this [inaudible]. Suppose that we have a rule that matched this [inaudible] and then we can replace X-1 with the translation of NPB and X-2 with a translation of VP. And the translation is "Bush had a talk with Sharon." And then we consider the second [inaudible]. Suppose that the rule matched forest like this and then we replace X-1, X-2, X-3, X-4 with a translation of this already translated node. So the second translation is "Bush and Sharon had a talk." So actually this is also a toy example. In our real decoder we will use rule table to convert the packed forest into a translation forest. So in a translation forest, each [inaudible] is associated with tree-to-string rule rather than a CMG rule here. And then we decode on the translation forest using language model to output the one best and K best durations. I will not give the details here. This slide gives our major results. The column indicates where the rules are extracted from, from one best trees, 30 best trees, packed forest. And the rule indicate on which we decode. So the input is one best tree or packed forest. So we can see that if we use one best tree in both rule extraction and decoding, the blue score is about 25. And if we use packed forest in both rule extraction and decoding, the blue score is about 28. So the improvement is about 2.5 blue points. It's very significant. And also if we replace one best trees with packed forest, in both training or decoding, we can also obtain significant improvement. And our result -- our forest-based decoder also outperformed higher, the state-of-the-art hierarchical-based system. And the third extended model we called it context-aware model. So in machine translation, a source word might have multiple target word as translations and for source phrase we might have multiple targeted phrases. So in our tree-to-string model, for source tree we might have multiple strings, targeted strings. Had X-1, called H-1 and X-1 took place and blah, blah, blah. So in decoding which rule should be used? So conventionally we use four probabilities. Relative frequency in two directions and lexical ways in two directions. And maybe we will resort to language model to have, to choose right rule. So we argue which candidate to be used in decoding should depend on context. If the surrounding context changed, the right rule should be changed. So for each left-hand side, I mean the tree, we will build a maximum entropy classifier to take contextual information to select the right, the best right-hand side. So the basic idea is quite simple. So in training time we will memorize the surrounding context and encode the contextual information into maximum entropy model. And then in decoding for each rule we will examine the surrounding context information and compare then and calculate a score for each rule using the maximum entropy classifier. So we call this context aware tree-to-string model we will design many features to capture the contextual information. So suppose that this is a training example, and we can replace a rule there. We just replace this as variables. So when extracting this rule, we will memorize some contextual information. So the first feature used in our maximum entropy model is called external boundary words. We care about left neighbor of the source phrase and the right neighbor of the source phrase. And another feature is that we care about the boundary words of the subtracted source phrase in the variable. In this case it's [chinese]. And we think that the part of speech of the neighbors, we call this external boundary part of speech might be useful for selecting the rules. So in this case it's NR and WEV. And similarly it's a part of speech of the inside of the inside boundary words. And we also care about how many words are in the subtracted source phrases. And the parent node of the tree. And the sub node of the tree. So we collect all this information in our training time and encode in the maximum entropy model and we will collect many training examples to train on the training data. And the decoding we will use this maximum entropy classifier to decide which rule should be used in decoding. >>: Can I ask a clarification? Do you have one classifier for each source configuration? >>: Yang Liu: Yes, so we have many, many classifiers. >>: A lot of the source configurations may be very sparse in data. I've only seen a very small number of instances of one source configuration. So you won't have many examples to learn your classifier from. >> Yang Liu: All right. So in this slide I don't have any example. >>: No, no. I'm saying that if you've only seen, say, four examples, four in the training data, you've only seen four times a particular source configuration, then you have very little data to train a classifier. >> Yang Liu: Yes, that's true. Yes. That's a problem. Because I mean the training example occurs so infrequently. >>: Right. >> Yang Liu: Yeah. So actually our paper did not investigate the fact of the training corpus. So if we -- I guess if we use more, larger training data, maybe we could have more accurate estimation. >>: It's a large training, the more classifiers you need to train. So for each classifier you will have more training data. >> Yang Liu: Yes. For each left-hand side -- for each tree we have to train as well. >>: People just transfer across these sub trees. >> Yang Liu: Oh, yeah. >>: And just to get more training for each classifier. >> Yang Liu: Yeah, I think this is good information. >>: Before we move on. You talked a lot about -- a lot of the features you've shown don't depend on what prediction you're making. >> Yang Liu: You mean the top side? >>: Yeah, most of your features are features of the source side. >> Yang Liu: Source side, yes. >>: How do those get tied to each possible target side? Do you treat each target side as a separate class or do you pick words individually? >> Yang Liu: Actually on the targeted side we have the in-ground targeted model. So the model can capture some non-local dependencies. And another problem is that if we design the features on the targeted side, we cannot use dynamic programming in decoding. >>: Right. Right. So I'm just saying for a given room, you're going to have a source side. Maybe I missed this some some of your features. But for a given rule you'll have a source side and then a bunch of targeted sides. Ignoring the rest of the sentence just a bunch of possible translations for that source. >> Yang Liu: Yeah. >>: Do you treat those each as a distinct class or do you look at them ->> Yang Liu: Yeah, distinct class. >>: Like option one, option two? >> Yang Liu: Yes. So every candidate targeted string is a targeted class in the classifier. >>: Can I have some idea of how many classifiers are in there? >> Yang Liu: How many classifiers? Actually if a tree just have one string in the rule table we will not train classifier. I don't know the number. >>: [inaudible]. >> Yang Liu: Maybe thousand, yeah. Maybe thousand on the APIS covers. And we compared our context aware model with context-free. That is tree-based model. And underneath the 2003 test set, the improvement is about .9. And on the NIST 2005 test set the improvement is about 1.2. And our tree-to-string models have kept the ones in the recent two years. And in SR 2008 MingJon from Singapore apply our tree sequence to tree-to-string model and they obtained very significant improvement. Actually their tree sequence-based tree-to-string system outperformed Moses. I think this might be the first tree-to-string system outperform Moses. And this year in SR 2009 We Jong [phonetic] also from Singapore, he combined the tree sequence and the packet forest together for tree-to-string translation. I didn't read the paper yet, but I think it is an interesting paper. And also in SR 2009 we apply the packet forest to tree-to-string translation, and we also obtained very significant improvement. So to conclude, our tree-to-string translation model is one of the syntax-based models. So we put emphasis on the syntax, on the source side and on the targeted side it's just a string. And we have presented a series of tree-to-string string models tree-based sequence-based forest-based and context-aware. And our work has an increasing impact in the community and many researchers are interested in our work and we follow up this direction. Okay. Thanks. [applause] Any questions? >>: Give us a very quick description of the system you used in this year's competition. >> Yang Liu: This evaluation? >>: In this evaluation. The tree-to-string, what is the ->> Yang Liu: In this year's NIST evaluation we used about four single systems. One is this one. Forest-based tree-to-string system. And another is a reimplementation of higher. Hierarchical free-based. And another is BTT. Yes, it's a first-based system. >>: So using the entropy model to reorder. >> Yang Liu: To reorder, yeah. And another is Moses. We developed. And we developed a new simple combination technique and our improvement over the single best system was three point in NIST evaluation. >>: What is the single best system? >> Yang Liu: I don't know. Maybe ->>: Which system? >> Yang Liu: Maybe the tree-to-string, the forest-based tree-to-string. We used about six million pairs of sentences to train our tree-to-string model. And we used our in-house parser to parse the Chinese text. >> Jianfeng Gao: Let's thank the speaker again. [applause]