>> Matthew Hohensee: Okay, so this paper we called "Getting More from Morphology in Multilingual Dependency Parsing." This is based on my Master's thesis research at UW in CLMS with my adviser, Emily Bender, and which we're also presenting at [Inaudible] next month. So I'll start with our sort of research question, can we use linguistic knowledge of morphological agreement to improve dependency parsing performance on morphologically rich languages? I'll talk a little bit more later about the motivation for that question. But the basic idea is that we have all these languages that have a lot of morphology, suffixes, prefixes, other kind of morphological markers, and we want to see if we can make use of that intelligently. And that is actually what came first, and then I decided that dependency parsing would be a good place to try to apply that. So this is a little overview of what we did; we developed -- And I'll go into much more detail on all this -- developed the simple model of morphological agreement, added it to a dependency parser called MSTParser and tested that on a sample of treebanks, 20 treebanks. We were able to get accuracy improvements of up to 5.3%, absolute accuracy. We found out that some of that was due to our model sort of capturing things that weren't actually related to the agreement that we were trying to model, so when we controlled for that the improvements were about 4.6% which is still pretty good. Outline of the talk: I'll give some background on agreement, on the CoNLL shared tasks which is kind of -- set the stage for this kind of work, other related work, and about the parser, methodology, some of the changes that we made to the parser and how we implemented that and the data we collected and how that was prepared. And I'll talk about our experiments and results and conclude. So to give some background on morphological agreement, there's this idea of morphological complexity which varies widely across languages which refers to the amount of -- not of different forms that a word can take depending on the case, number, tense, all these different factors. At one end of this spectrum we have morphologically poor languages. Sort of a canonical example is Chinese which doesn't really have any morphology. Words don't really take different forms depending on number, on person, any of that. English has a small amount of morphology. We conjugate verbs in a few cases; pronouns can be different depending on case. Nouns are different if they're plural. The other end of the spectrum is morphologically rich languages or synthetic languages. Canonical example there is Czech which has I think four genders and seven cases, and generally nouns can take forms for all accommodations of those; although, some of them are sort of collapsed so that actually not all accommodations are represented. That's a little example with the same sentence of Czech and English, and I can't pronounce the Czech. But you can see that the adjective which is the first word is inflected as feminine, plural and nominative case. The verb: feminine third-person, plural, nominative case. Sorry, that's the noun. And then the verb at the end: third-person, plural and present tense. The same sentence in English: we only have a little bit of inflection for the noun that it's plural, and sort of it's inherently third-person, although, that's not really marked in any way. And the verb is plural because if the verb were singular it would be grows not grow. The CoNLL shared tasks in 2006 and 2007 focused on multi-lingual dependency parsing. The organizers collected about 12 or 13 treebanks which were parsed and annotated, and those were distributed. And the participants trained systems on those and tested them on test sets that the organizers provided. The way the parsing was set up they would start with tokenized and POS-tagged sentences which also have morphological information on them annotated and predict the head and an arc label for each token. So this is some sample data of what that looked like. For that sentence we have for each word the ID is just the index of the word in the sentence. Form is just the word. A lemma which was not -- not all of the treebanks have lemmas in them, but some of them did so we ended up using them where they were if they were there. Then there's two POS fields, course and fine, and again not all of the treebanks had two POS fields. For instance we used the English Penn Treebank; there's only one POS. And I'll talk a little bit more about this later, but we ended up generating course POS tags for all those so that we would have two tags for each word. There is a filed with morphological information which is the most important for this work. And as you can see, it just includes information about any inflections or inherent sort of morphological information about the word. The name John is obviously third-person and singular, and pizza is singular. That's about all we have in this, in this example. And then for each token we have the head and the arc label or the relation type. So just to really quickly go through this. And in dependency parsing we're just looking at the head for each token, and everything is basically headed by the finite verb. So here, for instance, John in subject. That's headed by ate which is the finite verb, and that is in a subject relation. So what was found in the CoNLL tasks was that morphologically rich languages were the most difficult to parse. A lot of the parses rely really heavily on word order. And is sort of a high-level generalization, morphologically rich languages tend to have more flexible word order because you mark the constituent roles with morphology rather than with word order. So that was kind of a generalization that they came up with and then there's this quote from the organizers saying that one of the most important goals for future research is to develop methods to parse morphologically rich languages with flexible word order. So that's kind of the starting point for this work, what we wanted to sort of try to tackle. This is a little summary of different approaches to using the morphological information, and a lot of these citations are participants in those CoNLL shared tasks, not all of them but most of them. They don't fall too neatly into language-independent and languagespecific, but I kind of tried to group them that way. So if we're looking at these treebanks and looking at an arc, a potential arc, between head to dependent and each of those tokens has morphological attributes, we can take each of those attributes separately and include that as a feature for the token. We can keep the entire list of attributes. For instance, if there's person, number and gender all marked, keep that as one single feature which makes it a little bit more like a -- sort of like a really finely-grained POS tag. Or we can take all the combinations of the head independent. So if they each have three then we'll combine those in all different ways and sort of use each of those as a feature. More language specific approaches, we can morphological information on a token to pick out function words or finite verbs basically to supplement the POS tags here if we need sort of more specific information than we can get from the POS tags, adding detail to other features like POS tags. And then there was one -- and maybe there was one other approach who actually decided to model agreements. So if two words have the same -- are marked the same way for an attribute, for instance they're both plural, to generate a special feature for that. And that doesn't have to be language-specific which is kind of the idea of our research, but it was done by these people in a language-specific way. So I think they only chose a couple languages and modeled a few specific types of agreement. The other work that has been done has been a lot of work involving MaltParser which is not the parser we used. But that's been tested on a lot of different treebanks and a lot of languages, and they've always found that incorporating the morphological information can give them boosts in accuracy. So that sort of showed us that there was promise for using this morphological information. A little information on MSTParser. There's a citation. They use a machine learning approach, look at each arc between a potential head independent and there's a whole group of features that are generated. I guess I'll talk about the features in the next slide. They enumerate features at the arc-level, estimate feature weights and save all those. And then to do code they just find the highest scoring parse based on those feature weights. And then linguist knowledge here is incorporated most via that feature design. So the way we were trying to add some more linguistic knowledge was by tinkering with the feature templates. These are the features that are used by MSTParser out of the box. There's Lexicalized parent and child features, a lot of POS tag features involving the word order, the parent, child, surrounding words, intervening words. And then this is a list of the morphological features that it generates which -- It's pretty hard to -- I guess that's -- that's a lot to understand. Basically what that all is, is the index of the head dependent and various combinations of the word forms, the lemmas, and the direction of the attachment and the distance to the head or to the dependent. So here's a little summary of our agreement model. We decided to look at two tokens and add an agreement or disagreement feature if the head and the dependent are both marked for the same attribute. In other words if we're looking at say a noun and a verb and they're both marked third-person then we would just generate a feature that says, you know, "This noun and this verb agree for a person, and they are related." In the case that we had an attribute that was not matched on the other token, we added an asymmetric feature. So that just said, "Well, we have a noun and a verb. The noun is marked form some attribute, and the verb is not marked for that at all." So this approach is language-independent. We didn't really use any information about the languages. The treebanks were already annotated with morphological information that was generally in a format that we could just use straight out of the treebank. It represents a kind of backoff. We're using a little bit of information which is the actual value of the attribute. If we're -- say the noun and the verb are marked with third-person, it doesn't matter that it's third-person. What matters is that they agree. So we're just trying to save the useful information here. And it's high-level, so we're letting the classifier use agreement as a feature rather than having to discover agreement relationships as it goes through all the features. These are our feature templates, and I'm a little low on time so I'll skip those. Sample sentence and features generated. That's a sentence of Czech and that's just a list of the features which we generate which are agreement features whenever two tokens agree and the asymmetric features when they don't agree. This is the list of the treebanks that we collected. There's a range there. Hindi-Urdu has an average of 3.6 morphological attributes for each token. And we included the Penn Chinese treebank sort of as a reminder that there's this whole set of languages that don't have any morphological information really at all. And then there's sort of a range in between there. To prepare the data -- And I'll move pretty quickly through this -- we normalized the course POS tags. There is a paper by Petrov et al who suggested a universal tag set. So in the case where there was only one set of POS tags, we generated these course tags. And if there were already course tags, we sort of normalized them to be from the same tag set. We normalized the morphological so that it would be useful to us, basically just into the form of attribute and value. For instance, case equals nominative or gender equals feminine. And we generated morphological information for the Penn English treebank which doesn't have any. And we randomized the sentences, used five-fold crossvalidation and averaged accuracy, run time and memory usage across five folds. We can skip the projective thing for now. And we ran the whole parser on each treebank four times: once with the original features sort of out of the box, once with just our features replacing the original features, once with both sets and once with neither. So the MSTParser, the word order and POS tag features those were always retained. It's just the morphological features that we were swapping in and out our version versus the original version. This, again, I have a graphical summary of this on the next page. But what's important to point out, this is the twenty treebanks and the four feature configurations. And what I want to point out especially is that the run time in features are roughly the same, so this is no morphological features. And this is our set over here, so again the number of features is roughly the same for our version with is agr the agreement model and for no features. The original version has much higher feature counts and run time because so many features are generated. And we were able to cut down the size of the feature set a lot. This presents our accuracy results. The way to read this is, no morphological is the original -- not the original. It's the version of the parser with no morphological features are all. And that's the blue bar. And then the improvements due to adding each of the other feature sets is on top of that. So in every case the yellow one is the highest even if it's only by a sliver, and that's the accuracy using only our feature set. And that performed better than using both features sets, I think mostly because there were so many features generated by the original set that it was tending to swamp the classifier. Moving pretty quickly through this, this is just sort of what our results generally look like. I should move ahead. The Hebrew dataset we used comes with both gold and automatically-predicted morphological information, so we ran it on both and nd basically found that our feature set was a little bit more robust to using the predicted tags and annotations than the original feature set was. The original feature set lost about close to 3% and ours was closer to 2% when using the predicted tags. I think maybe I will skip over this so we have time for some questions. This is, to summarize it really quickly, we found that our feature set was capturing some information that was not related to morphology, that was improving accuracy because the original feature set didn't have any feature with just POS tags and the arc label. So we sort of compensated for that and improved the baseline a little bit. So basically controlling for that, these are our new results which is just due to the agreement model and not sort of to the insulary effect that we were capturing with our feature set. So it decreased our improvement numbers a little bit. Comparing to the previous slide, there's a little bit more of a correlation here between the amount of improvement to the top of the yellow bar and compared to the blue bar which means that we're getting -- So we're getting less improvement due to our features. But the amount of improvement we're getting is more closely correlated with the morphological complexity of the language, which is the -- I didn't mention that. The X-axis. So on the left Hindi is the most morphologically rich language, and Chinese at the other end is the least. SO we looked a the correlation statistically between the morphological complexity of the language and between the improvement we were getting and calculated the correlation of coefficient for that, Pearson's r. And once we compensated, once we controlled for that PPL effect using just our feature set, we got about .75 which is a much stronger correlation which is what we'd expect. Before controlling for that effect we were getting a much weaker correlation because that effect that I didn't go into in great detail was, I think, obscuring the correlation between the complexity of the language and the improvement that we were getting. So future work, these are some things we want to work on in the future. And I want to try to get to some questions. The answer to our question we asked in the beginning, yes we were able to improve parsing performances on all the languages with morphology and reduce feature counts and run times. And I guess we have maybe a minute or two left. So.... [ Applause ] >> : Maybe a question or two? >> : So you were using gold standard morphological tags. And you shows comparison on Hebrew with automatically extracted ones. But Hebrew was one of the last morphologically rich languages that you tested. So, I mean, how viable is automatic extraction of these morphological features for more morphologically rich languages? >> Matthew Hohensee: That's a good question. This was the only data we had that that was available for. So it would've interesting to try it on the more complex languages. We did find that the no-morph version here actually had the worst performance. So when there no morphological features, that's when we got the biggest hit from using the automatic tags. So that implies it was more the POS tagging not the morphological annotations that was decreasing the accuracy. But that's a good point. >> : So I assume you did this on first-order features and decoding? >> Matthew Hohensee: Yes. So MSTParser has second-order features about POS tags and word order. We didn't do any -- None of our morphological features were second-order; they were all first-order. >> : Would you expect the same improvement? Does it actually cover something that maybe the second-order features might cover if you just had, you know, agreement features? >> Matthew Hohensee: Right. That's a good question. Yeah. I don't know. >> : All right. Thanks. [ Applause ] >> Matthew Hohensee: I don't really know what we have to do here. >> Will Lewis: Okay, so for our second talk we have Anthony telling us about Hello, Who is Calling? Can Words Reveal the Social Nature of Conversations? >> Anthony Stark: Thank you. A bit of strange title and at its core what it was an initial product study in testing whether there was a utility in using [Inaudible] for downstream processing, in particular whether we could get clinical applications running out of this. And it was done between a collaboration between engineering and clinical side. So overall the biggest possible picture of what we're trying to -- Oh, I'll go over the abstract of the speech for this. I'll give you a brief motivation and overview of the data that we collected, the corpus, the ASR that we used and also describe the engineering problem and of course some of the experimental results on top of that. So at the biggest possible scope what we're interested in doing is inferring the preferential, social and cognitive characteristics of an individual, and the best possible way to do that is directly probe someone's brain. But -- Now that's not too far fetched. You've got the MRI's, but in a lot of applications that's not really practical. So we usually have to look at that through some sort of filter. And obviously speech is a pretty direct model of what people think, how they think and their cognitive capacity. And there's really two possible ways that you can look at that: you can directly go to the person and ask them their opinions, how healthy they are, etcetera, or you can observe them in their natural environment. And depending what you really want out of your [inaudible] whether you want subjectivity and naturalness, whether you have design constraints, whether that person has time constraints, you have to pick one or the other. And obviously for this study we picked the latter where we're directly observing natural speech in the natural environment. So what would the role of automation be in this? Well the biggest one of course is it's a lot cheaper. If you've got a thousand hours of recorded audio, it becomes very expensive and time-consuming if you want to do the low-level human transcriptions and processing on this. And the other important thing especially for our case was the privacy constraints. If you're recording these people in their natural environment, it could be very privacy-sensitive. And so we don't really want people looking through these conversations. So the current situation of the whole language processing field is their quite mature in separate areas. At the lowest level we've got ASR systems trained on thousands of hours of audio, and at the other end of the system we've got fairly mature text analysis. And even on top of that we've got a lot psychology studies that show correlates between various behaviors and how language is used. So unfortunately the literature is kind of sparse on testing that [inaudible], whether there is utility in using ASR transcripts right down at the end to see if you can find clinical outcomes. And really word error rate is not the end goal here. So the big probing of this study was to see how sensitive was our application to word error rate, whether we could deal with 10% word error rate or even 50% word error rate. So our specific application that we're interested in is we're looking at elderly individuals to track cognitive health, health outcomes. So the overall goal was to track these people and track their languages as they age or whether they developed dementia. So the clinically side of it, optimistically, we would want to provide a fully automated pipeline for diagnosing these individuals or at least flagging them when they're possibly susceptible to developing dementia. But more pragmatically we're just trying to find correlates and possibly provide a first-pass screening tool to see if we can map dementia onto the natural language record. On the actual engineering side of things which is what the rest of my talk will be mostly on, we're looking to see if we can efficiently, reliably and anonymously infer the type and frequency of social conversations. And as an additional point, are there any markers that are unique to the automated route. So the corpus that we collected is focused on the telephone usage of these individuals, and they're all healthy 75 years or older. The important thing here is that these are no subjects that we collected in isolation. These are part of an OHSU cohort that are already researching Alzheimer as best to fix. So there's a parallel data series of MRI's, clinical assessments of their physical health, mental health and activity reports so very dense longitudinal records. And the overall goal is to sort of do a backtrack of health outcomes, track these people one year into the future, five years into the future, ten years into the future, see what sort of health outcomes they have and then go back to the language record and see if we can find correlates or at least markers that give good indication. So the core of the study was to propose an additional telephone series on top of this, and on top of the acoustic information, the actual audio channel, we record the surface information: the call times, durations, the numbers that they called, etcetera, etcetera. So the reason why we actually started probing the telephone conversations is firstly it gives a direct insight into the capacity for independent living which is one of our main clinical outcomes that we're trying to look at. Secondly, it's quite a convenient channel to look at. They purposely use a telephone, and it has a fairly high SNR relatively speaking. And the age demographic still uses telephones which is probably the only demographic that still does it. And they use it quite a lot as I'll soon show you. So our initial run was ten individuals over twelve months and that was about twelve thousand individual calls. So it worked out to be three or four calls per person per day, so it's quite extensive usage. And the interesting thing is we recorded all telephone conversations, all incoming, all outgoing conversations for twelve months so a very dense record. And on top of that we did enrollment and exit interviews just so that we could get a few baselines for our ASR system. Currently we're collecting additional homes so we've enrolled 45 more people and it's up to so far fifty thousand conversations and about two and half thousand hours. So it is quite a large corpus at the moment. So I mentioned before that we're trying to see what sort of utility we can get out of ASR. And the ASR system we developed, it's nothing too special. There're no frills. We just wanted a standard, sort of out of the box, ASR system. So it's a standard Switchboard Fisher maximum likelihood. It's got some adaptions, speaker adaption, trigram model. And a very, very conservative estimate would be 30% word error rate, so that would be a flaw. More realistic, it's probably pushing about 50% word error rate. So you're thinking about one in every three words is just completely incorrect or one in ever two words. So it's still pretty hefty error that you'll get out of this sort of domain. So to the actual engineering problem that we had. The telephony surface records provide you a lot of useful information, what number they called, how often they called that person and you can develop sort of social networks out of this information. Unfortunately a lot of the high-level labels are still missing or incomplete. And what I mean by that is are they calling family members, are they calling friends, what sort of communications they're having. Is it just straight up formal business communications or is it more friendly social chit chat? And that's what we're really trying to get at with this data set from the clinical side of things. So as I said ASR error is pretty significant on this sort of open domain. And what we're trying to do is what is the sensitivity of the downstream analysis on this error? What features are robust? What inferences can we still make? So our primary goal here was we sort of did contrive some classification experiments whether we could determine business calls from residential calls, whether we can tell family calls from nonfamily calls, and whether we can tell if they're calling a familiar contact or someone that they haven't called before. And also an additional one, trying to tell family calls apart from just friends and other residential lines. So what we needed to do was process raw transcripts in a useful and an anonymized format. So the only way that we managed to collect this information was that we promised not to directly listen to their telephone conversations, and we promised not to directly read transcripts of these conversations. So the first thing, the first order to sort of classify that you can think of is just a straight up n-gram, content of words. After doing a little bit of normalization on the transcripts, if we removed stop words and also get rid of the rare tokens, we can form a pretty good baseline system. So just some pre-processing we built 50/50 partitions and we used some cross-validation training. So the n-grams surprisingly give very good performance, classificationwise. And they're all around about the same sort of region here, but you can see from the business and residential lines we can classify it with about mid-80% accuracy which is quite good for possibly 50% word error rate. The other tasks are a little bit more difficult so telling family members apart from non-family members but still relatively high accuracy, well above the chance 50% baseline. In terms of the actual n-grams, the unigram actually did the best which is not too surprising. That was about a ten thousand-size dictionary. Bigrams do a little worse, and trigrams really start dropping. That's not too surprising; if you have a 30% word error rate, that's going to have pretty bad trigrams. Some other somewhat surprising results is the linear-SVM did a little worse than non-linear [inaudible]. And if we add in surface features, the time that they made the call, etcetera, how long they made the call, it added very little classification utility to it. So it looks like you can get most of the contextual utility out of just the ASR transcripts whether they're error-full or not. So I sorted of papered over how I truncated a lot of those unigrams. Essentially if you got a very poor estimate, if they appeared in only one or two conversations so a rare word, we dropped it. If you don't do that, you get very bad accuracy, instantly take a 15 to 20% hit. So you can ask the question, what sort of additional robustness can you gain through some sort of linear dimension reduction? And there're a few flavors that you can look at there. You can do a priori mapping. You can do two types of dotted-driven line mapping. You can do supervised and unsupervised. So for the a priori mapping we went to a social psychology study, and it's called a Linguistic Inquiry and Word Count. And it sort of tries to map words down into sixty-ish categories of happy words, activity words, negative-emotion words, etcetera. And it didn't do particularly well, so it wasn't really trained on any sort of robust mathematical reduction. If we moved towards dotted-driven lines, we looked at Latent Semantic Analysis, and that did much better even if you reduce it down to about ten semantic features. And if we used a supervised approach, so mutual information, just prune out the features that have low mutual information, you get even better results. So with a dictionary of 250 words, you can still tell these classes apart quite well. And what that shows is there's great utility in using ASR because you can collect a lot of information and the dotted-driven techniques do work quite well. So this asks another question: the ASR transcripts are clearly not accurate but they do seem to be quite consistent. So even if they don't tell you exactly the right content, they are consistent enough that you can train a decent classifier. So another question you can ask is will structure-based features work? So part of speech tags, would that work on a 50% error rate transcript. And it turns out they actually do quite well. If we use a part of speech bigram, you do take a big hit over the content-based features. But still there's a surprising of utility there. And that actually does better than the psychology dictionary in a lot of cases. And that's a pretty crude representation of a whole conversation. If you try mixing it in with the content-based features, it generally doesn't work. So content features pretty much supersede everything else here. So now that we've got a sort of good baseline, you can start asking a few questions about how these conversations are structured. So an interesting one that we looked at is where does the actual utility come from? Does it come from the start of the sentence? The end of the sentence? Or does it randomly distribute it all over? So on this plot the first group of every bar plot is samples taken from the start of the sentence. The middle one is from end and the third bar is just a random segment. And we go ramp up the size of the segments. So you can see with a 30-word opening, you get pretty much all of the utility straight away. This is a business residential classifier by the way. The end of the sentence or random segments of the conversation are not too good until you get up to very large sample sizes. So I just showed there that you can get good classification with a forced 30-word window, and that's sort of contrary to general results that show results negatively correlated with the conversation length. You'd normally assume that if you had a long conversation, you'd have very robust features and you should get better results. But it turns out opening of short and long conversations are fundamentally different. So if we look at the accuracy stratified into conversation length, this is the straight up unigram. So, again, you can see the trend, very short conversations [inaudible] poor accuracy at 75.8, long conversations quite good accuracy at 93.8. The interesting thing is if we truncate that down to a 30-word window at the start, you still get the same picture. So if someone is going to have a long conversation, for example, they just more clearly annunciate their reasons for calling. So to sort of wrap up, we looked at whether there was merit in using ASR transcripts, and it does turn out that you can derive quite a lot of utility even if you have very high word error rates. So fidelity doesn't seem to be the prime mover here; it's just consistency of your recognizer. And you can make inferences with surprisingly small samples, this 30-word dictionary, a 1000-word -- a 30-word opening with a 1000-word dictionary. And a future work: we have yet to look at lower level acoustic features like [inaudible], talking rate, etcetera. All right. [ Applause ] [ Silence ] >> : Hi, did you take a look at the errors to see in the recognition whether it's like consistently missing words, like I'll never get the word school out of the recognition? Or is it like half the time I get school, half the time I miss something else, and they're kind of...? >> Anthony Stark: Not directly. The way that we measured word error rate is we enrolled them in an interview and that was very structured. And they were the only transcripts that we were allowed to look at. So it was very sparse, and that word error rate that I quoted was very much an extrapolation. So I can't actually answer that question too well. >> : I have a question about how you determined your stop words. So some stop words are actually useful for social -- distinguishing social interactions, so you might not want to throw away everything. >> Anthony Stark: Partially just through flagging common words and then me going through and sort of pruning out ones that a priori knew weren't too useful in the context so "and" and "the." >> : Very interesting work. I have a question about your bigram and trigram model. So basically you were saying your ASR is consistent even if it makes some errors. So I'm wondering if you have tried any smoothing techniques for bigram and trigram? Because it seems like your unigram works the best in this... >> Anthony Stark: Yeah. >> : ...case. >> Anthony Stark: So I did so some [inaudible] smoothing and it did slightly better but not significantly so. So I sort of just chopped it down to a lower dimension feature and just left it at that. It seemed robust to quite a lot of different pruning and smoothing techniques. [ Applause ] >> Will Lewis: So now... >> : This is Shafiq. >> Will Lewis: So now Shafiq will tell us about a novel discriminative framework for sentence-level discourse analysis. >> Shafiq Joty: Thanks. Hi. So in this talk I'll be presenting a novel discriminative framework for sentence-level discourse analysis. But as I say that our approach can easily be extended to [inaudible] tags. I'm Shafiq Joty, and this is a joint talk with my advisers Dr. Giuseppe Carenini and Dr. Raymond Ng. So now let's see the problem first. So we are following the rhetorical structure theory of discourse [inaudible] tree-like discourse of structure. So for example given this sentence, "The bank was hamstrung in its efforts to face the challenges of a changing market by its links to the government, analysts say." The corresponding discourse tree is this. The leaves the discourse tree corresponds to the contiguous atomic textual spans which are called elementary discourse units or EDU. Then the adjacent EDU's or the larger spans are connected by a rhetorical relation. For example here the first two clauses are connected by an elaboration relation then the large span is connected by an attribution relation. Then [inaudible] in a relation can be either a nucleus or a satellite depending on how central the message is to the other. So part of the computational tasks in RST. Given a sentence like this the first computation task is to break this text into sequence of EDU's which is called discourse segmentation. Then the next task is to link these EDU's in here, label hierarchal tree. This is called discourse parsing. Here the important thing to note that the fact that here the EDU one and two are connected by an elaboration relation with have an effect on the higher-level relation, on the attribution. These kind of dependencies are called hierarchical dependencies. Again, if we had four EDU's and the relation between EDU 1 and 2 will have an effect on the relation between the third and fourth. So these kind of dependencies are called sequential dependencies. So we need to model these kind of dependencies into a model. So what's the motivation for this rhetorical parsing? Our main goal it build computation models for different discourse [inaudible] task in asynchronous conversations. That is, conversations where participants collaborate with each other at different times like e-mails or blogs. So we are mainly interested in topic modeling, then dialog act modeling and rhetorical parsing. So we built on supervised and unsupervised topic segmentation model to find the topical segments in an asynchronous conversation. Now we are working on topic labeling so do come to our poster to know more about this work. And we built an supervised model to cluster the sentences based on their [inaudible] like question or answer. Right we are working on rhetorical parsing, so this talk is a part of this project. But we'll only show the results on two models of corpora. And it has been shown to be useful in many applications like text summarization, then text generation, sentence compression and question answering. And we'd like to perform the similar task in asynchronous conversations like e-mails and blogs. So here's the outline of today's talk. I'll first briefly describe the previous work. Then I'll present our discourse parser followed by the discourse segmenter. Then we'll see the corpora or datasets on which we performed our experiments. Then we'll see the evaluation metrics and the experimental results. Finally I'll conclude with some future work. So Soricut and Marcu first presented the publicly-available SPADE system that comes with [inaudible] models for discourse segmentation and discourse parser, sentence-level discourse parser. They take a generative approach and their model is entirely based on Lexicosyntactic features constructed from the lexicalized-syntactic tree of the sentences. However, when the model [inaudible] in a discourse tree, they assumed that the structure and the label are independent, and they do not model the sequential and hierarchical dependencies between the constituents. Recently Hernault et al. presented the HILDA system that comes with both segmental and a document-level parser. The model is based on SVMs. In the parser they use two SVMs in a cascaded fashion while the job of the first SVM is to decide which of the two adjacent expanses are to connect then once this is decided the job of the next upper-level SVM is connect this spans with an appropriate discourse relation. So as you can see that their approach is a [inaudible] approach and they don't model the sequential dependencies. Now all these works like these two works are on newspaper articles. On a different genre of instruction manuals Subba and Di-Eugenio presented a shift-reduce parser. And their parser relies on an inductive logic programming based classifier, and they use rich semantic knowledge in the form of compositional semantics. However, their approach is not optimal and they do not model the sequential or hierarchical dependencies. On discourse segmentation, Fisher and Roark, they used binary loglinear model which performs the state-of-the-art segmentation performance. And they show that the features extracted from the parsetree like syntactic trees are indeed important for discourse segmentation. Now let's move onto the discourse parsing problem. So for now just assume that the sentences has already been segmented into a sequence of EDU's. For example here we have three EDU's in the sentence. The discourse parsing problem is to decide the right structure of the discourse tree. So we have to decide whether EDU 2 and 3 should be connected into a larger span then that larger span should be connected with EDU 1, or EDU 1 and 2 should be connected then the larger span should be connected with EDU 3. So you have to decide the right structure and the right labels for the internal [inaudible] which varies on the relation and the nuclear status of the spans. So our discourse parser has two components. The first one is the parsing model that assigns a probability to all possible discourse trees for a sentence. Then the job of the parsing algorithm is to find the optimal tree. Okay? So part of the requirements for our parsing model. We want to have a discriminative model because it allows us to incorporate a large number of features, and it has been [inaudible] that the discriminative models are in general more accurate than the generative ones. We want to jointly model the structure and the label of the constituent. We want to capture the sequential and hierarchical dependencies between the constituents and further more our parsing model should support an optimal parsing algorithm. So here's our parsing model. Just assume that we are given a sequence of observed spans at level i of the discourse tree. Remember that we want to model the structure and the level jointly. So we put a hidden sequence of binary structure nodes on top of this. So here is three. Here is [inaudible] nodes whether span two and three should be connected or not. So this is binary node. Then on top of this we put another sequence of hidden -- hidden sequence of multinomial relation nodes. So here our [inaudible] if span 2 and 3 are connected then what should be their relation? So here you can note is that we are modeling the structure and level jointly. Now this is an undirected graphical model. Now if we model the output variables that is the structures and the relations directly, this is basically dynamic conditional random field, and we are modeling the sequential dependencies here. Now you may be wondering how this model can be applied to obtain the probabilities of different discourse tree constituents. So here it is. We'll apply this model at different levels and compute the posterior marginals of the relation-structure pairs. So for example just consider that we have four EDU's in sentence. Then here's the corresponding CRF at the first level. Then we apply the CRF and compute the posterior marginal of this constituent then this constituent then this constituent. Now the second level we have three possible sequences. In the first sequence where EDU 1 and 2 are connected in a larger span. So here EDU 1 and 2 are connected in a larger span. So here's the corresponding CRF. We compute the posterior marginals of this constituent then this constituent. The other possible sequences where EDU 2 and 3 are connected in a larger span. So here you do 2 and 3 are connected. So here's the corresponding CRF and we compute the posterior marginals of this constituent and this constituent. The third possible sequence is this, so here EDU 3 are connected in a larger span and the corresponding CRF is this. We compute the posterior marginals of this constituent and this constituent. At the third level we have two possible sequences. One where EDU 1 through 3 are connected in a larger span. This is the corresponding CRF. We compute the posterior marginal of this constituent. Then again for EDU 2 through 4 in a large span, we compute the posterior marginal of this constituent. So by computing posterior marginals we'll have all the properties for all the discourse tree constituents. Now these are the features used in our parsing model. Most of the features are from the previous work. We have eight organizational features that mainly captured the length and position. Then we have eight N-gram features that basically captured the lexical and part of speech features. We have five dominance set features which has been shown to be useful in SPADE. Then we have two contextual features and two substructure features. In the substructure features we're incorporating the head node of the left and right rhetorical subtrees. So by means of this we are actually incorporating the hierarchical dependencies into our model. Now once we know how to derive the probabilities for different discourse tree constituents, the job of the parsing algorithm is to find the optimal tree. So we have implemented a probabilistic CKY-like bottom-up parsing algorithm. Now, for example, we have four EDU's in a sentence. The dynamic programming table will have four into four entries and we'll be using just an upper triangular portion of the dynamic programming table. So T(i, j) will [inaudible] the prob of this constituent, and m is the argmax of what the possible structure and r is ranges of the relations. So as you can see that we are finding the optimal based on both the structure and the relation. So this approach will find the global optimal discourse tree. Now at this point, we have already described the discourse parser assuming that the text has already been segmented into EDU's. Now let's see our discourse segmenter. So the discourse segmentation problem is given a text like this, we're going to break the text into a sequence of EDU's like this. And it has been shown that inaccuracies the segmentation is the primary source of inaccuracy in the discourse analysis pipeline. So we should have a good discourse segmentation model. So we framed this problem as a binary classification problem where for each word, except the last word in a sentence, we have to decide whether there should be a boundary or not. And we're using a logistic regression classifier with L2 regularization. And to deal with the sparse boundary tags, we using a simple bagging technique. Now these are the features used in our discourse segmentation model. You can find the details in the paper. We are the SPADE features. Then we are using Chunk and part of speech features. We have some positional features and contextual features. Now let's see the datasets or corpora. Now to validate the generality of our approach, we have experimented with two different corpora. The first one is the standard RST-DT dataset which comes with 385 news articles, and it comes with a split of 347 documents for training and 38 documents for testing. In sentences we have 7,673 sentences. For training and for testing we have 991 sentences. And we are using the 18 [inaudible] relations which has been used in the previous work. And by attaching the Nucleus-Satellite with these relations we get 39 distinct discourse relations. Our other corpus is the instructional corpus delivered by Subba and DiEugenio. It comes with 176 instructional how-to-do manuals. It has 3,430 sentences, and we are using the same 26 primary relations and treat the reversal of the non-commutative relations as separate relations. So in our case the relation [inaudible] two different relations. And by attaching the Nucleus-Satellites we get 70 distinct discourse relations. Now to the evaluation metrics. We are using to measure the parsing performance we are using the unlabeled and labeled precision, recall and f-measure as described in Marcu's book. For discourse segmentation we are using the same approach as taken by Soricut and Marcu and Fisher Roark; that is, we measured the models ability to find the intra-sentence EDU boundary. So if your sentence contains three EDU's that corresponds to two intra-sentence segment EDU boundary, we find the model's ability to find these two intra-sentence EDU boundary. Now here are the results. Let's first see how the model performs when it is given or the parser performance when it is given manual segmentation or gold segmentation. So here's the result. Now one important thing to note that the previous studies mainly show their results on a specific test set. So to compare our approach with their approach, we have shown our results on that specific test set, but for generality we have also shown our results based on 10-fold crossvalidation on a specific corpus. So here you can see on the RST-DT test set our model, like the DCRF model, consistent outperforms the SPADE. Especially on relation, it outperforms SPADE by a wide margin. And our results are consistent when we do the 10-fold cross-validation. And if it compared with the human agreement, our performance in [inaudible] is much closer to the human agreement. And on the instructional corpus the improvement is even higher in all these three metrics. Now let's see the results for discourse segmentation. So here you can see on the RST-DT test set, our logistic regression-based discourse segmenter outperforms HILDA and SPADE by a wide margin. And our results are comparable to Fisher and Roark's result on the same test set, but we are using fewer features so it's more efficient than that in terms of time. And when you see the 10-fold cross-validation, we get similar results. On the instructional corpus, our model outperforms SPADE by a wide margin. Now if you compared the results between these two corpora, you can see that there's a substantial decrease in the instructional corpus. It may be due to the inaccuracies in the syntactic parsing that we are using and also maybe the tagger that we are using as features. Now let's see the end-to-end evolution of our system that is where the parser is given the automatic segmentation. So here are the results. On the RST-DT test set our model outperforms SPADE by a wide margin and we get similar results when we do the 10-fold cross-validation. And on the instructional corpus, there is a substantial decrease. But we cannot compare with Subba and Di-Eugenio because they do not report their results based on an automatic segmentation. However, if you compared the results between these two corpora, you can see that their substantial decrease is due to the inaccuracies in the segmentation. So we have also performed an error analysis on the relation labeling which is the hardest task. So here's the confusion matrix for the discourse relations. So the errors can be explained by two [inaudible]. One is the most frequent ones confuse the less frequent ones. For example here elaboration confused the summary. And the other one is when the two discourse relations are semantically similar, they tend to confuse each other. So we need technique like bagging to deal with this imbalanced distribution of the relations and we need a more rich semantic representation like subjectivity or compositional semantics. Okay. So to conclude, we have presented a discriminative framework for sentence-level discourse analysis and our discourse parsing model is discriminative. We captured the structure and label jointly. We captured the sequential and hierarchical dependencies and, furthermore, our model supports an optimal parsing algorithm. And we have shown that our approach outperforms the state-of-the-art by a wide margin. Now in the future work we would like to extend this to multi-sentential text applied to asynchronous conversation like blogs and e-mails, and we'd like to investigate whether segmentation and parsing can be done jointly or not. So, that's it. Thanks. [ Applause ] >> : You go back to your slides around 12, 13 where you show the parser. >> Shafiq Joty: Yes. >> : Go back, back. So I just want to know. Go back, back, back. Back. Yeah. Right here. So when you're doing the process, when you build up layer by layer, level by level, do you make all possible adjacent... >> Shafiq Joty: Yeah. >> : ...[inaudible]? >> Shafiq Joty: Yeah. >> : And then you compute the score that determines whether you want to merge it or not? >> Shafiq Joty: I take all possible sequences. So here is at the level 4 we have two possible sequences. This is one is... >> : I see. >> Shafiq Joty: ...one. This is the second one. >> : Yeah, but that's [inaudible]. I mean you have one through four, one through five and... >> Shafiq Joty: Yeah, but... >> : ...[inaudible] determine... >> Shafiq Joty: Yeah, but... >> : ...[inaudible]... >> Shafiq Joty: But the spans can be just -- Only the adjacent spans can be connected. >> : Oh, I see. Okay. >> Shafiq Joty: So here you can see one, two, three. >> : Okay. >> Shafiq Joty: Yeah. >> : Okay, so now the score is based upon the posterior probability... >> Shafiq Joty: Yeah, the posterior probability. >> : ...[inaudible]. So that looks very, very similar to this recent work done by the parsing [inaudible] by Stanford group which is... >> Shafiq Joty: The future-based CRF of... >> : No, not future. It's a new network-based. >> Shafiq Joty: Oh. >> : It's [inaudible]. >> Shafiq Joty: Oh. >> : They're actually sure -- I think the difference really lies in how do you compute... >> Shafiq Joty: The posterior? >> : ...the score that determines whether you want to merge it or not. And that's actually the most crucial one. >> Shafiq Joty: We are not aware of that talk, though. >> : You're not aware of that, okay. >> Shafiq Joty: Yeah. >> : We can follow up with that afterwards. That'd be great. One more quick question? >> : Yeah, so I'm not very familiar with RST and I'm curious how it handles spans that are not continuous. For example, "This talk which covers work on RST was just ended," or something that when you have elaboration... >> Shafiq Joty: Which one? [Inaudible]... >> : So I can think of in language... >> Shafiq Joty: Yeah. >> : ...some spans are not continuous. >> Shafiq Joty: Yeah. Yeah. So there is like a graph theory of discourse that poses like the discourse [inaudible] should be a graph not a tree. So yeah, there is recent work on this. But there is no computation [inaudible] that can like [inaudible]. >> : But in this case the spans will be defined on a... >> Shafiq Joty: Yeah, like here... >> : The EDU's will be defined... >> Shafiq Joty: Yeah, the EDU's can be like separated not necessarily adjacent. Yeah. Thanks. [ Applause ] >> Will Lewis: So for our last speaker Ryan will tell us about measuring the divergence of dependency structures cross-linguistically to improve syntactic projection algorithms. [Inaudible]... >> Ryan Georgi: All right. So that's a bit of a mouthful. So this work that I did with Fei Xia and Will Lewis of UW and MSR. We'll just go right into it. So right now there are thousands of languages in the world, and most of those don't have a lot of resources. Most of NLP is focused on a handful of few that do have large annotated corpora. Cost of creating new corpora for most of the world's languages is a pretty limiting factor. So there is some previous work where we can project annotation from one language that does have the resources available to a language without those. But that annotation is generally of limited quality due to the differences between the languages. And it's not easy to tell beforehand the potential that the adaptation will have from one language to another without specifically knowing facts about those languages and how they differ. So this work, we want to look at the impact of common divergence types, differences between the languages and how that affects projection accuracy and hopefully improve it. So I'm going to start with a review of the previous work on projection and linguistic divergence. Then I'm going to take a look at how we detect and measure that divergence, and then the experimental results, and conclude. So we used dependency structures for this task. They abstract a little bit away from some of the issues that free structures might have where it's order-sensitive and has internal representation nodes that might not map so well. This kind of distills it down to the basic structures so that we can really see the core differences. So just a quick run through of basic projection algorithms where we start with a bitext, in this case English and Hindi. First we get the word alignments. Then we go ahead and we can run a monolingual parser on the English side and get an English dependency structure. Then we just use those word alignments and make a pseudo-Hindi dependency tree assuming that the structure resembles that that we see on the English side. So projection allows us to take annotation from a research-rich language and project it to one that has none using the word alignment to bootstrap it. It's got the obvious advantage of we can do this with parallel text if we have the resources available for one language on one side of the bitext. Unfortunately this relies on what I'll call the Direct Correspondence Assumption that when a word is aligned between one language and the other that essentially they have the same syntactic and semantic meanings. Obviously that's not the case always, so linguistic divergence is cases where the direct correspondence assumption is violated. So Dorr's '94 paper outlines a number of types of divergence that can occur between languages. I don't have time to go into all of them, so here's just one of demotional divergence. So in this case the "like" that's the head of the English dependency tree here gets demoted as it categorically shifts with the head of the German sentence which is "eat" and becomes a different category. That noun on there is actually kind of wrong; it should be adverbial. And we get a mismatch between the dependency structures. So Dorr's divergence types are handy but they really require language-dependent knowledge. You really have to know that that's something that occurs in German versus English and write that into whatever you're doing. So we'd like to see if we can discern kind of the frequencies and the types that happen between language pairs and see if we can do that programmatically kind of with minimal knowledge about the language pairs. So the first thing that we have to kind of figure out is define what we mean by similarity. When you're looking at two trees, most tree similarity metrics assume that you're looking at two representations of the same string. When we're looking across languages, we're actually going to have a different number of terminals so those metrics may not necessarily come out to match even though they're representing the same thing. So in this case we want to count the number of matching edges between trees as made by the word alignments. So the similarity for a tree pair, s and t here, is the percentage of edges in one that match in t as defined by those aligned words where the children aligned and the parents of those children are aligned. The similarity for the whole treebank is defined just as the percentage over the trees. So we have a number of different alignment types that we see frequently when we're looking through the trees. We have the Merge Alignment where multiple words in one side map to the same word on the other. There's a swapped alignment where a child and a parent on one side map to an inverse relationship like in the German example. And then we simply have a case with no alignment where we may have a word aligned with another word but something, a spontaneous intervening node, may kind of break what otherwise would be a match. So we'd find some operations that correspond to those different alignment types. We have a merge operation where if we have multiply aligned words we simply combine them and promote the children of the child node to be parents of the new merged node. We have the swap operation where we swap the child with the parent and move the children along with it. And we have the remove node which is essentially like the merge -- Or remove operation is essentially like the merge operation except we don't rename the new merged node. So with these operations we can go about calculating the divergence between the treebanks. So using those different alignment types we performed the corresponding operation, and before and after we calculate the edge match percentage to see what impact there is by applying that operation. And as we go all the way through we eventually end up with two different trees that should resemble each other maximally after taking into account all of those divergences. So just to walk through what that looks like. First we start off with a bitext. In this case it's actually an interlinear gloss text. So the -It's not shown here but there's a one-to-one correspondence between the source language here which is Hindi and the English gloss text here. So each one of these tokens aligns directly with the one beneath it. Then we can get word alignment through matching the gloss with the word in the translation. So we end up with trees. These are actually the gold trees that look something like this. As you can see there's a significant difference between the "caused" given in English and the "give" with a causative marker on it. So first we identify the words that are spontaneous that weren't shown in that alignment and remove those. Then we still have a multiply aligned with the "caused" given versus the "give" causative, and we want to merge that. And in between each one of these steps as we remove these nodes, the percentage of the match between the trees increases because now we no longer have those edges that are unaligned. And as we merge these two nodes we lose an edge that wasn't matching previously between "book" and "cause" on this side and what would be "book" and "cause" on this side. So we did this on four different language pairs: English-Hindi, English-Swedish, English-German and German-Swedish. And we started off with just the baseline before we apply any operation of how many matches there were in the treebank. Then we removed every spontaneous word from treebank so you have what the match will look like then. Then applied merge and then swap. One thing that I should point out too is there are two numbers listed, there's for English to Hindi and Hindi to English because the denominator essentially in the percentage is the number of edges in that tree for the language. It's actually an asymmetrical measure. They're generally similar-ish especially after we've removed spontaneous words. But, yeah, it is asymmetrical. So one thing that was -- So they all correspond pretty similarly in jumps after removal, after merge, after swap. One interesting thing is the Hindi at the very end is at 90% match. All the other languages are at 78 to 80% match. So this first table here is actually from the Hindi-English treebank. And it's much -- It was actually guideline sentences. It's very clean, clear translations. The other three tables here are from the Smultron database which has, again, supervised parallel trees but the sentences are much looser translations of one another. So after applying our operations, it still seems that there is quite a bit of a gap to be closed. So we drilled down a little bit deeper trying to get a breakdown of the operations as they applied different part of speech tags. In English and Hindi we see that verbs collapse with other verbs like in the "give-caused" version especially with auxiliary verbs. Hindi tends to have those outlined. In English and German we see that nouns collapse pretty frequently with other nouns as German does a lot of compounding that English doesn't. For swaps there's a lot of head-changing in the verbs in Hindi, so the English verbs will often change heads with the aligned verbs in Hindi. And then again in German some of the compounding between the treebanks causes nouns to swap position. And then finally just removals. Determiners tended to be the most kind of spontaneous between English and Hindi. English and German had other determiners that just didn't match. And, yeah, there are some interesting German-specific prepositions that got removed in the German case. So moving on. Just to review is to find three operations that capture common divergence patterns, measured the effect of those operations on the four language pairs. And so what we want to do as future work is perform these same tests on a much broader selection of languages. Ideally since we have interlinear text in ODIN which is a lot like the interlinear text we used for Hindi, we should be able to run that on upwards of a hundred languages. We can try learning the high frequency transformation and apply those as rewire rules to projected trees. Or what we're working on right now is actually using the inferred patterns to inform a dependency parser that is trained across parallel treebanks and then is fed parallel text. And, yeah. I got quite a bit of time for questions. [ Applause ] >> : Hi. It was a good talk, thanks. I'm not [inaudible] to speak of Hindi but what I could see it looks a bit artifical sentence. So... >> Ryan Georgi: Yeah. >> : So I was wondering if the numbers that you shows is indicative of the nature of the corpus? >> Ryan Georgi: Yeah, so these sentences are -- So interlinear text does tend to have a slight bias in that it is usually instructive sentences that are highlighting specific differences between whatever the linguists native language and the target language is. So there can be a bit of a bias in those interlinear examples. So that does seem a little artificial to you? Yeah? I mean, the ODIN database four -- some languages it's got thousands of IGT's. For some it's only got twenty or thirty. We have seen that when you get enough examples the stiltedness of a couple of the sentences will tend to average out as the linguists look at different factors of the language. But, yeah, some of the individual examples can be a little biased for instructive purposes. >> : Okay. But do you know the way this corpus was collected from? Like [inaudible]? >> Ryan Georgi: Oh, yeah. So these sentences in the Hindi-English were actually guideline sentences for annotating the Hindi Urdu treebank. >> : Okay. >> Ryan Georgi: Which I, yeah, I believe I have the citation down here. I think that's still in progress. Yeah, the one at the top there. So these in particular are going to be super instructive so that the annotators can pick up the right thing to do. >> : Okay. Thanks. >> : So I just got a question about the results that you showed. So the previous slide, yes. After the remove and merge steps, how many edges are you basically knocking out of the tree? >> Ryan Georgi: Yeah. >> : And is there a trivial solution that gives you a tree with one node in it on both sides that gets 100% on all of these [inaudible]? >> Ryan Georgi: Yeah, so no we never get that bad. The trees are pretty well -- So, so the alignments that we actually have were done first statistically and then corrected by hand, so they're basically hand alignments. And most of the stuff is in there. I don't have the numbers up on here, but I think from beginning to end it was roughly a third reduction in the edges. >> : Okay. >> Ryan Georgi: From... >> : And these numbers are -- That's basically the denominator is changing here, right, when... >> Ryan Georgi: Yeah. >> : ...you remove [inaudible]. Okay. >> Ryan Georgi: Yeah, so it goes from -- Yeah, exactly. >> : Okay. Okay, no that's fine. >> Ryan Georgi: Yeah, yeah. >> : I got the idea. Thanks. >> Ryan Georgi: Yeah, occasionally there can be some cases where the numerator will change as we merge something that was also aligned, but the denominator will decrease at the same time. >> : Sure. Okay. >> Ryan Georgi: Yeah. >> : I have two quick comments. >> Ryan Georgi: Yeah. >> : First, so I think it's the next slide where you break it down by part of speech. I'd also be interested in at the lexical level. There might be like specific verbs... >> Ryan Georgi: Yeah. >> : ...like... >> Ryan Georgi: Yeah. >> : ..."like" versus [inaudible]. >> Ryan Georgi: Yeah, yeah. >> : That might be accounting for a large percentage of... >> Ryan Georgi: Yeah. >> : ...what actually happens. So it'd be interesting to see. There's just a few examples that, a few specific constructions that account for everything or if it's a wider divergence between the two languages. >> Ryan Georgi: Yeah, yeah. So with the Hindi examples in particular the treebank -- And, again, I'm not a Hindi speaker so I didn't know. If the -- The case markings in a lot of cases are separate words that aren't mapping onto English and that, I'm guessing, is a pretty closed class that would probably see the same repetition over and over again. So, yeah, lexically it would be probably an [inaudible]. Yeah, good point. >> : And also to point you to work by Laurie Levin and Alon Vi on -They didn't do dependency parsing, it was syntactic parsing, but it was very similar trying to project the two trees to each other. >> Ryan Georgi: Yeah. >> : So I'll give you that reference later. >> Ryan Georgi: Cool. Thanks. [ Applause ] [ Silence ]