24372 >> Emily Prud'hommeaux: So my name is Emily Prud'hommeaux. I'm a graduate student at the Center for Spoken Language Understanding at Oregon Health and Science University. And I work with Brian Roark. And today I'll be talking about graph-based alignment of narratives for automated neurological assessment. So to motivate the neuropsychological different types of as many vegetables narrative and have problem a little bit, I'd like to point out that exams often rely on spoken responses elicited by stimuli. For instance, you might be asked to name as you can in a minute or you might listen to a to retell it. And, of course, these are usually part of normed instruments. So they're standardized scoring procedures that are usually carried out on the responses, but you can also think of the responses as a source of language data that could be analyzed for other kinds of diagnostic markers. So I'll be talking about narratives today. There are different kinds of narratives that can be elicited. Personal narratives. You can do narrative generation, where you narrate a picture book or a story from a series of pictures or cartoon or video clip. Or this is one of the most common ones is that the subject listens to a story and has to retell the story to the examiner. And that's usually scored in terms of how many story elements were used. So there's information in the original story. And they're graded on how well they replicated that information. Now, seniors with Alzheimer dementia produce fewer of these key story elements in their narrative retellings. And so the question here is whether we can use these narrative scores to detect very early stages of dementia. The earliest stage of dementia that's usually recognized is called mild cognitive impairment. I'll talk a little bit about that in detail shortly. So can we identify it? Can we identify MCI using these narrative scores, and also can we do it automatically, because we're computer scientists and we like to do things automatically. So the goal is to develop an objective automated tool for MCI for detecting MCI that relies on scores derived from narrative retellings. And the way we're going to do this is to think about retelling as kind of a translation. So imagine that the story, you hear it and you translate it into your own language. It's the same language, but it's your own special idioelec of that language. So I'm going to use existing word alignment tools that are used for machine translation to create machine translation style word alignments between retellings and a source narrative and improve those alignments using a graph-based method. And then I'm going to extract scoring information from the alignments and use those scores as features for diagnostic classification. So a little overview of the data. I mentioned mild cognitive impairment. It's characterized by impairments in cognition that are significant but don't interfere with your daily life so you can still drive your car and balance your checkbook. You know who your grandchildren are and all those things. But it is clinically significant. It's real. It's happening. Your cognition is starting to decline. But because it's so subtle, it's very hard to detect with something like the mini mental state exam. Instruments like that that are used to screen for dementia. MCI is diagnosed with a long interview by an expert between the patient, with the patient and also with someone who can corroborate what the patient says, like a spouse or other family member. And we're going to use the clinical dementia rating, which is one of the techniques for diagnosing different levels of dementia. And it is one of these interview-based techniques. Typical aging is a clinical dementia rating of zero. MCI is a clinical dementia rating of 0.5. This is how we'll interpret it. Severe dementia would be like a clinical dementia rating of 3. And I'd like to point out the diagnosis does not rely on any narrative recall task. So the narrative recall task is completely independent of the process by which MCI is diagnosed. So this is our data. We have, at OHSU we're a medical school, we have something called the Layton Center for Aging and Alzheimer's Disease Research. And they have this very long NIA-funded longitudinal study on brain imaging where people come in, they give them the mini mental state exam, have them do a bunch of activities and tasks in their interviews. And during this exam they're given the interview for which, by which you can diagnose MCI. And so we have 72 subjects with MCI and 163 subjects without. There are no significant differences in age or years of education. And there's also additional 26 subjects who have more advanced dementia or were too young to be eligible for this study or had a diagnosis that changed back from MCI to typical aging, and so we didn't want to include them in either diagnostic group since it's not clear which one they're actually in. The narrative task we'll be talking about is the Wechsler logical memory sub test, the Wechsler memory scale. The Wechsler memory scale is a widely used instrument used to assess memory in adults. Been used for 70 years. The story that we'll be talking about has actually been used for 70 years as well. So the subject, the examiner reads a story to the subject and the subject has to retell that story immediately and then after 20 minute delay. The score is how many story elements were used in each retelling. The identities of the story elements are noted. But they're not used as part of the score for the Wechsler memory scale. This is the narrative itself. And you can see that the slashes denote the boundaries between elements. So there are 25 story elements. This is a sample retelling. And the underlined items are the recalled, correctly recalled elements. So you can see that Ann is a correctly recalled version of Anna. But Taylor is not a correctly recalled version of Thompson. And that's because the published scoring guidelines give pretty explicit directions about what are acceptable lexical substitutions and what are not, what sorts of paraphrases would be acceptable. This person gets a score of 12 out of 25. So I mentioned that I'm thinking of retelling a translation and that I want to do word alignment. So instead of translating from German to English, I want to translate from the story to the way someone rendered that story. And so the way the statistical machine translation works typically is you begin with this sentence, parallel corpus of sentences where you have sentences in one language on one side of the corpus and translations of those sentences on the other side of the corpus in a different language. And the idea is that you need to figure out which words are translations of which other words. And the way you do that is through word alignment. And this could be easy. It could just be like you monotonically go through and line them all up, or it could be really complicated where there's lots of word order differences and things like that. So you can't just use Levenshtein distance or something like that. have to do something smarter. This is usually done using the IBM models which were developed in the '90s when they first, when they You started getting interested in machine translation again after many years. I'm going to use expectation maximization to figure out which words align with one another. And there are two widely used word alignment packages. Actually Giza is widely used Berkeley Aligner is not widely used but I found it gives more accurate alignments and it gives you out of model and align other data you didn't train on which if it's possible with Giza it's not immediately obvious how you would do it. It also saves out the posterior probabilities for those alignments. That will be important for the graph-based method I'll be discussing. I'll use these alignments to extract scores from narrative retellings, which I'll then use for diagnostic classification. So I said you needed a parallel corpus. And I have three parallel corpora I've created just from the retelling data that I have, just people retelling the story. So the first one is small. The second one is a little bit bigger and the third one is huge. So the first one is the source to retelling corpus. All I have on this side is the source narrative, and all I have on that side are the retelling. This is about 500 or so lines long because that's how many retelling we have. Corpus too is a word identity corpus. We're just saying the word cook should align to the word cook. And because we're doing monolingual alignment, this is a good assumption to make. If you see a word on one side and you see the word on the other side, there is some high probability that they're going to align to each other. The third corpus, the huge corpus is a retelling-to-retelling corpus. So every pair of retelling. So this is 500 squared, 250,000 lines. So what I do is I use the Berkeley Aligner to build two models. The small model is just built on corpus one and corpus two, just the retelling to source in the word identities. The large model is built on corpus, all three corpora. The difference in it, takes like two or three minutes to build the first one. It takes like 12 to 24 hours to build the second one. Very, very big difference in time required to build these kinds of models. So I'm going to test both models on the two retelling for each of the experimental subjects so we can see how well they align to one another. And then I'm going to use both models, because you can save out the models with the Berkeley Aligner to align every retelling to every other retelling. I'll be using that in my graph-based model. So these are the -- I'm just looking at time. Okay. So these are the results of the alignment, the precision recall and alignment error rate. Alignment error rate is like word error rate. It's sort of a measure of the precision and recall of how many alignment pairs were found in the proposed alignment that existed in the gold manual alignment. You can see that you get a very large reduction in AER as you move from the small model to the big model. Almost four points. Which is very, very large improvement for word alignment. But those are still pretty high. State of the art on the Euro [inaudible] corpus where the Berkeley Aligner is four and we're at 20. So we can do a lot better. So the idea we had was to use a graph-based method that uses random walks on graphs. So you probably all know about page rank. Google's way of ranking Web pages. And it has to do with you build this graph and the nodes are Web pages, and the edges of the hyperlinks between those Web pages. And if you were to just walk around on the graph you created like that, the nodes that you end up on would be the more prestigious nodes, the more important nodes. Lex rank is a way of ranking sentences for automatic summarization. So for word alignment, the nodes are words and retellings and the source and edges are the normalized posterior weighted alignments proposed by the Berkeley Aligner. We know the alignments and what their posterior reliabilities were. Imagine if you had these four sentences. The source is on the bottom. These are sentences from our subject. You can see the bold face words should all be aligning to touch. But suppose they didn't. Suppose that when you did your alignment of just the sourced retelling you only got moved aligning to touch and sorry aligning to touch. So then you align every other retelling, every retelling to every other retelling, and you uncover the relationship that moved aligns with sympathetic, and sorry aligns with sympathetic. You want to get from sympathetic to touch. You couldn't do that in your original alignment. Once you build this graph and start walking around on it, you can start at sympathetic and go to moved and from there you can go to touched. So the idea is that it creates a connection between words that maybe were unaligned in your original alignment that you can now uncover by virtue of the relationships that word has with other words in other retellings. So we build a graph using the alignments and posteriors generated by Berkeley Aligner. This is the way the graph works. You started in node which is a retelling word. With some probability you move to a word in another retelling. And with some other probability you walked to a source word and you break, and you do this a thousand times for every retelling word. And the destination source word you end up at the most often is your new alignment. So you have a distribution at the end of a thousand walks of which retelling word, source word you ended up on, you picked the most frequent one. That's your new alignment. You can tune the value of this lambda on those 26 ineligible participants. They can serve as DEV set for this. I do this for the alignments for small and large model. We can see here if you take the small model, apply the graph-based method to the alignments proposed by it, you actually get over four point reduction. So it's a larger reduction than you get just by moving to the large model and keep in mind that the large model takes 12 to 24 hours. This graph-based model takes three minutes tops. It's very fast. So you're getting the same benefit with very, requiring very, very few computational resources. We also see that the graph-based model outperformed both of their correspondingly sized models as well. So this bodes very well for using graph-based models. So now I want to extract scores from those alignments. And I'll explain how I do it. This is the narrative. The elements are labeled with letters of the alphabet. The 25 elements. And this is a word alignment that the Berkeley Aligner proposed. So what we do is we look at the word from the retelling. We got rid of the function words because who cares about them. And we look at a word in retelling and it aligns to a source word, and we see where does that source word appear in the original narrative. Appears in element A. This person got element A. Taylor aligns to Thompson, where does Thompson appear? In element B. Worked, employed, element E and so on. That's how we get the score. We know for every element did they get it or not. So I evaluated the scores that you can extract from this and actually under all the models the S measure is very, very high. In fact, I'm giving you the Cohen's kappa. That's the measure of the interannotator agreement. This is actually within the range of human interannotator agreement on this task. So this is a computer performing as well as a person on the task of scoring this test. And in addition we see that the models with the lower alignment error rate produce the higher scoring measures which we're glad to know. So now we want to use these scores for diagnostic classification. So what we do we extract scores from each subject through retellings, and we use the support vector machine to classify the subjects. We have two feature sets. One feature set is just the summary score. So for each retelling, zero to 25, how many elements did they get. This is the score that's reported as part of the Wechsler memory scale. The second feature set is the 50 element score. So for each retelling there are 25 elements. So we create a vector of 50 scores each being 0 or 1 depending on whether they've got that element correct. And then we're going to evaluate this in terms of area under the receiver operating characteristic curve using the pair validation. So with AUC .5 is chance. And 1 is perfect. So anything greater than .5 is good and the closer it is to one the better it is. So we can see here that the summary score feature -- the summary score is what's reported, normally, is it does pretty well. It gets closest to about .75 for most of them. But actually the element score features are much, much better. And they're getting very high classification accuracy. And another thing the note is that the clinical dementia rating has a reliability at about that level. So you could sort of say that this technique is working as well as humans are at actually distinguishing mild cognitive impairment. And again we see the large models and the graph-based models are stronger than the ones that are small and not graph-based. So can I finish? It says stop. All right. Well, you're on your own for the summary, people. So the methods outlined here shows potential of screening tool for neurological disorders because it's not just this test that's widely used. There's another test called the NEPSY narrative memory that's used for kids, and description tasks, easy to adapt for other scenarios. And the other thing that was good was the graph-based methods yielded large alignment error rate improvements without requiring the extensive computational resources that scaling up to a really large model would. So other improvements in the graph-based model right now just one-to-one alignment because I just picked the single highest most frequent destination node over the distribution. But there are plenty of one-to-many alignments in our data. That's something I would like to look into. I'd also like to look into using undirected links and allowing links out of the source where I feel you could be exploring the graph a little bit more than I am. I'd like to apply the technique to other tasks like I said the NEPSY narrative memory task and cookie theft picture description. You look at a picture of a kid stealing cookies from a cookie jar and have to describe it. I'm working right now in incorporating speech recognition into the pipeline with Miter, who is going to be talking about something similar in her poster, and I also want to try to apply the graph-based method to multi-lingual word alignment. [inaudible] has lots and lots of languages, seems like something like this might be able to be used to improve word alignment in some way. So that's all. [applause]. >>: Questions? >> Darlene Wong: Hi. Hello. >>: Hi. Sorry so what is the accuracy of the alignments if you just map same word to same word? >> Emily Prud'hommeaux: It's like 40. Maybe 30 or 40. So it should -- the thing is like it should be good because the probability that a word aligns to that identical word is quite high. It's like 60 percent. But the problem is that there are multiple instances of words. So you have to decide which word you're going to align it to. think that's where the inaccuracy comes from. And I >>: You talked about kids. Were you talking specifically about the Wechsler test for kids or other tests for kids and what's the neurological impairment, because I'm assuming it's not dementia at that point. >> Emily Prud'hommeaux: Very early on set dementia, it's a really terrible problem. So the test specifically I was talking about was the NEPSY narrative memory test. The NEPSY is like a huge battery of things that test not just memory but language skills and executive function and all of those different things. And actually I've already applied these methods to that and the results -- they're not quite as compelling but we don't have as large a dataset but they're pretty good. But also if you're familiar with autism at all, there's an instrument called the Autism Diagnostic Observation Schedule that's used to, a set of series, series of semi structured activities. One is called the word list picture book. And the kid and the examiner together narrate a picture book. And this is something -- this is a technique that could probably be applied to something like that as well. The neurological impairments we're interested in, are autism and language impairment. The NEPSY is tied to language impairment performance on that. >>: Hi. One more question. So were you doing the alignment on the entire, like -- so it wasn't like sentence-based and then match? >> Emily Prud'hommeaux: So it wasn't sentence-based. It was the full retelling to the full source narrative. Which is weird. That's not what you would do in machine translation. Machine translation you would have it have sentences aligned but because we don't know in advance which parts of the story the adults are going to remember, you can't do -- you'd have to do a sentence alignment first. And the different story elements might appear in one sentence that appeared in two different sentences and the source. So we just put it all together. >>: Because from the point of view of a source of error, like the IBM models are kind of designed to work on sentences and kind of imagining that you're doing an alignment on much longer dependencies. >> Emily Prud'hommeaux: Yes, that is almost certainly why, one of the many reasons why the alignment error rates are much higher than they would be for machine translation. >>: Thank you very much. [applause] >> Congle Zhang: Hello, everyone. My name is Congle. And this work is done together with Raphael Hoffmann and Dan Weld. Dan is my advisor and very glad to come here to talk with you about this work. So as you may know, relation extraction is a very important part of natural language processing and artificial intelligence. I hope this simple example can help you understand the goal of relational extraction. Suppose we have a row of sentences, like "Our captain example of Jenali Jenkins is a phenomenal athlete, said the Gator's head coach Urban Meyer." So after human being read these sentences, he can get some interesting facts like the Jenali Jenkins plays football team like data Urban Meyer college football play later. So the question can the machines do the same thing as human beings, get these facts and input it into the machines. And so formally relation extraction is a problem that we have in ontologists with set of relations with type signatures. For example an athlete is playing for a team. We have type signatures athlete and team plays for team. So we are interested in the facts of this relation. So we want an extractor to, the input of this extractor is some row sentences. The output of the extractor is some two posts satisfying the relations you are interested in. So suppose the relation you extracted from the row sentence for for example like Jenali Jenkins and Urban Meyer does not exist in your knowledge base, you can add this instance back to your knowledge base, add this instance back to your ontology, and your ontology is becoming larger, better and more useful. So that's the goal of the relation extraction, and that's the first step to build a better knowledge base to use for other tasks. Okay. At a first glance you may think that supervised learning is the best way to do this task. So in order to build a classifier for the extractor, what you need is a lot of training sentences. For each training sentences you need to figure out the entities in the sentences and their relationships. For example, like Ray Allen and Doc Rivers, they satisfy the relationship culture by you, put it, you label it as a positive example. And YouTube and Google, you know that Google acquired YouTube actually, but this sentence like YouTube API and Google code say nothing about acquiring relationship so you have to label this as an active example. Okay. Supervised learning is good but what's the problem? The problem is it cannot scale easily. So let's see the example like the 82005 dataset contains only 1500 articles. So the reason that supervised learning is hard is not -- the problem is that it is not only hard but it's almost impossible. Because the data is very -- the positive data is very skewed. So most sentences actually do not contain any interesting relationships in your ontology, in your small set of relations. So here is the dataset. So the ratio of positive sentences actually is less than 2 percent. So for the top 50 relations in free base which means that if you ask a human labeler to label your sentences, they will meet one positive sentence after 15 active ones. So I don't think many people have that patience to label enough data for this task. To avoid -- so researchers want to avoid labeling this very -- to avoid this kind of labeling sentences stuff, so they propose weeks of revision to leverage the instances. The idea is that, okay, it's hard to get labels, sentences, but it's easier to get label relation instances. For example, we know that Gator and Urban Meyer, they satisfy the relationship coach. And we know Google and YouTube satisfy the relation required. And it's easy to get a list of unlabeled sentences from whatever you can imagine. The clever part of the weak supervision is you heuristically generate the training examples by matching the instances into the unlabeled sentences and then return all sentences that contain this pair of entities as the training, as the training example for your classifier, for extractors. Of course, it brings in some noises to your extractor, but since the number of unlabeled sentences is so huge, you can do a lot of machine learning stuff on this kind of interesting data. Okay. Life is so good, until we ask a question. So what if the training instances here is also small, because the previous work during the weak supervision, they try to avoid this problem by only looking at the relation that's already existing in the database, which means that it can get large amounts of training instances for free almost. So but what if you want to define a relation by yourself for some bio task, for some whatever you want for some question answering, the relation may be very specific. You only have some examples for that. You don't know which database to look at. You don't know where to look at. So we want to solve this problem. So our motivation is that there are some very large background ontology on Web; you can leverage these background ontology. They contain like millions entities and thousands of relations. So what if we can build a connection between our target ontology, our target relations to this background around ontology, and they're very likely among these millions entities a lot of entity peers satisfy my target relations, if I can dig them out, use them as the training instances, I can significantly increase the number of training instances for my task and I can do weak supervision very well. So that's the idea. We record ontological smoothing. So our goal I have a relation, want to do training instances, I generate the training instances from the background ontology. The method is we do ontology mapping from my target relation to the background knowledge base, to the background ontology. Here is the overview of our system. We call it velvet. The first step is to build a mapping from the background ontology to the target ontology. The second step is to generate new training instances and training sentences from the mapping and from the unlabeled sentences and the third step is to train relation extractor with this new data. Okay. So what's the challenge to do this job? There are two major challenges for our ontological smoothing idea. The first challenge is that there might not be some explicit mappings. You may ask a question, can I use a very naive way to pick an automatic relation in the background knowledge and return all instances as much as what you need? The problem is that usually good mappings is not explicit in your background knowledge. You need some database operator like join, like union, like selections to get what you want. Here is a simple example. For example, in your target relation, target you have a relation. Coached by, it has an example. there is what you the team for team Kobe Bryant and Mike Brown. So in your background knowledge, no direct connection between Kobe Bryant and Mike Brown, but know is Kobe Bryant plays for the team LA Lakers and coach of is Mike Brown. For human beings, we can see if we join plays and team coach I can get a lot of two posts. If I take the first argument and the third argument of these two posts, I get a lot of good training instances for the relation coached by, that's what I want. But how can a machine do that? So another besides join, union is also important, because the same relation can be spirited into different domains in your background knowledge, and they may even have different names. Very hard to get their relationship at the first glance. For example, like the head coach, the team coaches are named as the head coach in basketball domain and manager in baseball domain. What you want is to put them together in our system. So the output of our system is a view, a database view with the operator of join union selection. And build this -- build mappings from the target relation to a view over the background knowledge base. The second challenge is that we do need to put, do need to do the entity type and relation mapping as a whole, jointly do the mapping. The reason is that the entity is very ambiguous in the background knowledge. For example, you can see that there are many [inaudible] background knowledge is one is basketball coach, one is football player, another is a politician. So without context, talking about, your coached by relation have big confidence you're looking for. how could you know which one is the guy you are target relation? But under the condition that is mapped from plays for team, join team coach, you that the basketball coach Mike Brown is the guy Okay. So we can handle these difficulties, these challenges in our system by two step, by break down into two step. The first step is to generate mapping candidates and the second step is to choose the mapping candidates by the joint inference. So we first put the background knowledge into a graph with each node stand for, each node is entity in the background knowledge and each edge is a relation in the knowledge. So we look for the instance pairs in our target relation. the path between them as the relation mapping candidates. We return And then we return type of these nodes as the type mapping candidates. Such kind of method is sometimes, has some problem. It's noisy. You can see This is pars as mapping that Kobe is born in U.S. and Mike's nationality is also U.S.A. a parse between the two arguments and you'll return this noisy a candidate. So we need to specify the likelihood to these candidates, in order to get correct ones. So what we do is to use the Markov knowledge network model which is good to use the joint inference with the evidence writing in first order logic. The probability of the event in Markov logic is written by the number of satisfied rules times their weight. And in our work we have three kinds of predicates. One for entity mapping. One for relational mappings and another for the type mappings. We write these predicates together in first order logic rules. Actually they are features for our observations and we conduct an MAP inference to get predicates, to get truth value for the predicates, which means that the predicate is true, which means the mapping is good. Okay. To do the MAP inference we simply cost the problem into a interlinear programming and do LP relaxation and to surroundings to get the result, which is quite standard method in textbook. So you can check it. Okay. So here we are. Where we are is that we have the ontological mapping. We have a lot of new training instances and examples. What we need is relation extractor, a weak supervised method to do the relational extraction. We use Mozart [phonetic] in this project. It is developed by Raphael Hoffmann in our group. So it might be, to my best of knowledge it might be out-of-shelf relational extractor you can find now. And it scales very well to like millions examples, datasets. So it's useful tool. Okay. So here's our experiment. We compare our system velvet against three baselines by taking the joint inference away, by taking the complex mapping away by taking a smooth inference away from system. We do the experiment on two target ontology, NELL and IC. NELL there are 43 relations. IC there are nine relations. We use the free base which contains like 100 million. In fact, as our background ontology. So we do the experiment and labeled datasets, label sentences of New York Times, which contains almost millions of articles and like 50 million sentences. Here is our performance. So you can see that without smoothing, it's not surprising that without smoothing performance is very bad because the system is trained only by like dozen training examples, by putting without, by putting complex mapping and ontological smoothing into the system the performance is improved significantly and velvet is the best of all. So last slide shows the figures averaged by instances. We can also average by relations. This result, the result in the last slides may be a little too optimistic because big relations is easier to do than the small relations. This is average by relations. We can also see that velvet is much better than baselines. We also compare our system, our method, to the current state-of-art supervised extractions and kernel, data 40 data set we use two state-of-art supervised approaches as comparison and but we do the experiment, our system use very few training instances. Only ten grand instances per relation. There's no sentence annotation. It's surprising like that. We can achieve comparable result to the state of our supervised method. It's because we use large number of unlabeled dataset. So they are trained like by a thousand sample sentences. We are using like millions unlabeled sentences. That's why we can get comparable performance. Okay. We also evaluate performance of ontology mapping itself. We manually label some mapping results, and we achieve like 88 accuracy, relation extraction 93 on entity mapping. The result is like 5 percent better than the baseline which is free base internal search API to get entity mapping, which is about 88. Okay. So our system -- so we noticed that the previous week supervision doesn't scale very well, because if you have very few training instances, so our solution is to use background knowledge, background ontology, to generate ontology mapping to bring you some, a large number of new instances. It can enhance relation extraction performance very well. Well, here's some future work. For example, we're planning to bring more, bring multiple ontologist, do mapping to multiple ontologist. We're interested in -- so not only just binary relation but [inaudible] relations. For weak supervisions there's a lot of space to improve like the data is very skewed and there are -- and the future is like -is extremely high dimension. Sometimes makes the performance quite hard to improve. So there are some future work to improve. There it is. [applause] >>: Thank you very much. We have some time for questions. >>: I wanted to ask you a question about N-ary relations and then you put it as future work. And but I was also thinking, do you really need N-ary relations, because maybe you can do everything with binary relations? So where do you see the benefit of dealing with N-ary relations? >> Congle Zhang: So I think N-ary relation is that you sometimes, so you can break -- you can break the N-ary relation into several binary relations and there are some major entities. That's the simple way to do the N-ary relation. But I'm not sure if it's good for all situations. Maybe there are some cases it will not work very well. For example, if a lot of -- if many arguments are like the date or numbers, I don't think it will work, because I'm not sure if it will work because these arguments are related to each other. If you treat each of them as binary there may not be as many linguistic evidence to get them. I don't know. Maybe it's the case. >>: I might have missed this, but did you evaluate the precision of the relations you're generating from the smoothing or are they sort of 100 percent correct? So you have some seed relations and then you get more, or some seed examples and you get more from free base, are those like 100 percent ->> Congle Zhang: I know what you mean what's the position of the examples? I didn't label that. Yeah, this is a good question. So we only label that by using them and to train extractor. But I didn't sample the instances to see the performance. Good question. >>: Makes sense. >> Congle Zhang: We didn't do that. >>: All right. Thank you very much. We'll move on to our next -[applause] -- we'll move on to our last oral presentation. Next Max Whitney. >> Max Whitney: Okay. So I'm Max, and this is work with my supervisor Anoop Sarcar, and the topic of our paper is bootstrapping which is the case of semi supervised learning where there's a single domain and the small amount of labeled data or seed rules. Okay. Okay. So bootstrapping. And in particular we're looking at the Yarowsky algorithm, which is a simple self-training algorithm with decision list. So we start with some seed decision list. We label the data and we train a new decision list and repeat. And a decision list looks like this. The running example here is word sentence disambiguation from the word sentence. So one sense is a piece of text and one sense is a punishment. So here we have two seed rules. A rule is a score and a feature and a label, a sense. And the decision list works just by choosing the highest ranked rule, which had the feature matching the example. So here when we apply the seed rule to data, the first two examples in the data tend to be labeled, because they have a feature that matches the rule. matches. And the third one can't be because it has no feature that So that's the first two steps. And the third step we train a new decision list. And the scores are coming from statistics over the currently labeled data. It's basically co-occurrence with the previous rules. And you can see now we have more rules. We can label more of the data. But we have a threshold. We only take good enough rules based on the score. So we're never guaranteed to be able to label all the data, just whatever we have featured for. And we repeat. At the end of all of this we'll drop the threshold, make a decision list with no threshold and then we can label all the data. So we do that for testing. This is sort of similar to EM, or difference is that we're training contains type level information. counts over the actual instances, hard EM, at least. But the this decision list model, which Whereas EM is, you have expected the token level. So to examine the behavior of this in more detail we're going to continue with the same example. Now the current decision list is on that side and we're just showing the two senses, list the colors. And the currently labeled training data is going to be down there. Also labeled by color. So in this graph, the left bar is going to represent the decision list and the right bar is going to represent the currently labeled training data. And up here we're going to show accuracy. The current test accuracy with the decision list. So as we proceed, the decision list, that's the left bar, is growing very rapidly. And we've now labeled I believe that's all of the test, the training data. And you can see both are kind of skewed towards the blue label, whichever sense that is. And now that we've labeled all the data, the accuracy kind of plateaus. And we converge there, about 60 percent accuracy. So the next variation is Yarowsky-cautious from Cohen and Singer. didn't say that we're using Cohen and Singer for our particular specification of Yarowsky and Yarowsky-cautious. I And here cautious, when we make a decision list, we take the top five rules by score and each iteration we take five more. Decision list is now going to grow linearly. And you can see that whereas in the previous example the axis went up to 1800, now it's only up to 400 because we're controlling the decision list. So as we run it, the decision list, that the left bar is growing linearly, and it's balanced between the two features now. And the coverage and the accuracy are going to grow more gradually and we converge at a higher accuracy. So now this is our noncautious and this is our cautious. Both will change when we do a retraining step and we drop the threshold and we drop cautiousness and train big decision lists but usually we see this behavior that cautious avoids getting stuck at lower accuracy. So the reason we're interested in Yarowsky is that it seems to do pretty well. The first four here are own reimplementation of Cohen and Singer. And the last is a version from Abney, which we'll talk about later. And you can see here that the co-training algorithm and Yarowsky-cautious are the best of that set. And Cohen and Singer also have an algorithm called co-boosting, which does comparably in their results. We haven't tested it. And you can also see there's a caution algorithm. Yarowsky-cautious and co-training which does something similar are the only ones that get up to about 90 percent. So cautious is important. And the reason we're interested in Yarowsky over the other algorithms is that co-training and co-boosting require two views, and the views are supposed to be independent. So it's nice to be able to drop that limitation and just use self-training. So that was the upside. The downside is we don't have very good theoretical analysis for Yarowsky, there's no proven bounds. Abney in this paper addressed that. But only for certain variance and those variance are not cautious. And he didn't give empirical results. We don't think they would do as well without cautiousness added. Haffari and Sarkar extend this analysis. And they use a bipartite graph representation that we'll see in just a minute and we do have empirical results but it's not cautious and therefore it does not do as well as what we just saw. So here's the analysis. We're going to look at two types of distributions. This is the parameters. So it's a distribution for each feature over the labels and that's the labeling distribution, with the distribution for each example over the possible labels. So the labels being the senses in the examples we saw. So the labeling distribution is uniform with an example is currently unlabeled otherwise it's a point distribution with mass on the label. And the parameters are just the decision list score except we've normalized them to be a distribution. So a decision list will make this choice just taking the maximum scoring and feature from an example. And Abney introduces this alternate form where we take a sum instead. It's not quite a decision list. But it is easier to analyze. Okay. So switching topics slightly. Subramanya, et al. have an algorithm more recently which they use on a part of this task. I believe it's part of speech tagging, and it's domain adaptation thing. So it's a slightly different task and slightly different algorithm. So we're not concerned with the details of their algorithm, but a couple of interesting properties. And this is self-training versus CRF. So you can see it had the same two steps relabel data and train but they've added extra steps, one to get type level information and one to do draft propagation on top of that. The two things we're interested in here is first the overall structure of adding of these steps and second the particular graph propagation they use. So our own contributions for this paper, number one, we have a Yarowsky variant with a perturbation objective, which we'll see in a minute. And this algorithm is cautious, and we can show that it performs well. Unlike the previous well-analyzed Yarowsky algorithms, we show that it is, that it can do as well as Yarowsky-cautious. Second, we've unified all these different approaches, the different algorithms. And third, more evidence that cautiousness is actually important. So going back to the draft propagation, this is an objective from Haffari and Sarkar for one of Abney's Yarowsky algorithms. It's an upper bound on the algorithm. And this is the objective for the graph propagation of Subramanya. It's not the objective for the whole algorithm we just saw, but just for the draft propagations. And you can see in the first equation, this is the labeling distribution. That's the parameters again. So if we compare these, the first term of each is going to be the distance. If we plug those distributions into this. It's going to be the distance between the parameters the decision list scores and the current labeling. And the second term is a slap or regularizer so they're quite similar. So if we do plug those distributions into this, then we can directly optimize that model, the bipartite graph model. Alternatively, we cannot use that motivation for Haffari and Sarkar, but we can just do graph propagation over the thetas, the parameters, where we take co-occurrence in an example to be adjacency. So it's not as well motivated to not work, but it sort of corresponds to what Subramanya, et al. do. So here's our own algorithm. It's like Yarowsky but it had this extra step. We train a decision list exactly the way we saw with cautiousness. And second we do propagation over that. So we just take the parameters and we make one of those two graphs and we propagate that. And Subramanya, et al. give updates to do that propagation. So we can either do it on the bipartite or unipartite graphs we saw or couple more join the paper. If we use the bipartite one we only take the parameters at the end, the decision list part of the model. And so because we're doing this step, we know that at each iteration we're optimizing the objective that we saw. And we can do cautiousness simply by copying the decisions of this decision list, the original decision list. So this one determines which examples we'll label and this one determines what the labels actually are. So to look at the behavior of this algorithm, we'll do the same visualization. And so again the left bar is going to be the decision list and the right, the labeled data. So you can see that like and cautionness were forcing the decision list to grow linearly and balance between the two labels. And now we're labeling, because of the propagation we can label more examples sooner. And the accuracy is increasing sooner. And comparing to what we just saw for Yarowsky-cautious, we're actually doing a bit better in this case. Again, they'll both change when we do a retraining and a decision list, but the behavior, we increase coverage and accuracy quicker. We can also look at what happens to the objective. This is the objective from Subramanya graph propagation and here we've disabled cautionness, because by changing the input to the graph it's changing the objective a lot on cautiousness on the bipartite graph, this is the objective globally and it's decreasing and it kind of levels off at the accuracy and the coverage of the loss. So we have two sets of experiments. The first is following the running example. So this is Eisner and Karakos's WordSense disambiguation data from the Canadian Hansards, three words each of which is English and not French and two senses and two seed rules. The features are the words adjacent to the word we're looking at and some nearby context words and we use them to [inaudible] raw forms of each. So this group of algorithms is the ones we've seen. This is Safari and Sarkar's algorithm based upon the bipartite graph and that's a different kind of graph propagation. And you can see it doesn't do well. And this is our algorithms. So in this case the cautious form of our algorithms, this is the bipartite one and this is the unipartite one, and you can see the unipartite one is doing pretty close to the L co-training cautious beating cautious and the bipartite one is doing pretty well, too. The data here is a little bit strange. The data sizes are small. The second task is this named entity classification from con and singer and in this case we have three labels, person location and organization. Seven seed rules with some for each label and the features are spelling features, which are from the phrase we're classifying, and context features which will extracted from a parse tree of the sentence. So particular words nearby in the tree and the relative position in the tree. And, again, the algorithms we've seen, this one and our algorithms and, again, the theta only the unipartite one is coming out quite well and the bipartite one is not that either. In this case, our algorithm is the top. But we're not really trying to show the algorithm is beating Yarowsky-cautious the co-trained here but we're just trying to show that it's coming out equivalent in accuracy. But as we said it doesn't have the disadvantage of requiring two views. What I didn't say we're reporting on seeded accuracy here. So that means we take accuracy only over the examples that the seed labels did not label so the idea is to measure improvement over the label. That's not what we were given. And that's it. you. [applause] So we have software online if you want to see. Thank you very much. Sorry about that. So we have time for a few questions. Thank Wait for the mic to come on. >>: Great talk. I wasn't able to figure out if the algorithms are able to once you label an example, later that example is to escape from the labeling and go back to the unlabeled. >> Max Whitney: Yeah, the steps are -- >>: I saw the labels are always increasing. >> Max Whitney: Yes but we relabel every label at iteration it can become unlabeled. It doesn't happen very often but it is possible. >>: Can you elaborate on the problem of requiring the two views so how bad it is and in practice what ->> Max Whitney: I don't know how bad it is in practice. But my understanding is that the theoretical requirements for those algorithms are that the two views are statistically independent. And so you only get the theoretical properties if your features have that property, which is quite unlikely. And so the idea is that we can get the same performance with a much simpler algorithm would you the having to have that property. >>: I mean, the performance is on a specific dataset that doesn't have those properties, I guess. >> Max Whitney: Well, on the datasets we've seen it's doing comparably to the co-training algorithms. So we don't know that it's better but it's doing comparably and we don't have to have the requirement on the views. >>: Also we had to try many different ways to split the features for the WordSense data and we picked the best one. But we had to do a search, like there was no natural -- nearby and far way is not necessarily a natural split of features. So it's better -- the machine learning technology should not have to think about it. You plug it in and it should work. So why do this extra work if you don't have to. So never do co-training and always do Yarowsky. [laughter]. >>: Okay. Then that's going to conclude our second oral session. [applause]