> Lucy Vanderwende: My name is Lucy Vanderwende, and... Chris Biemann, and the purpose of the presentation is first...

> Lucy Vanderwende: My name is Lucy Vanderwende, and I have the pleasure of introducing Chris Biemann, and the purpose of the presentation is first and foremost to introduce Chris and his work to a much wider audience than he would otherwise be able to reach by having one meeting at a time. Chris comes with Powerset and has come from Powerset, but only joined a couple of months ago, in January. So quite recent, having finished his Ph.D. at the University of Leipzig the end of last year. And I'll say that there's another purpose, which would have taken place anyway, because as a group we've been very much interested in what Chris is doing on unsupervised modeling of language. And so this talk would have been -- we would have hosted Chris no matter where he was. It just happens that he's now a member of Microsoft. So that's fun. But when I heard that he had gone to Powerset. I'm like that's too bad, I didn't know Chris was looking around because I was going to talk to him about maybe doing a post doc here. So here he is, and we're very happy to welcome Chris to Microsoft. >> Chris Biemann: Thank you very much, Lucy. Thank you for the introduction. And thank you for coming in so large numbers, which I would not have expected, frankly. I'm here today to talk about some recent work our group in Powerset has been doing on word sense induction and word sense disambiguation. And the purpose of this talk is two-fold. First of all, I want to tell you all these nice things that we pursued and the nice ideas we've had and how we tried to evaluate them. But the other purpose is to get you excited about this new platform you just bought at Microsoft which is the Powerset Natural Language Search platform, which could be used to plug in many things you probably might not have dreamed of, given the situation before you bought Powerset. So I hope that this will trigger a lot of discussions and talks in the remainder of the day and the weeks to come. So this is the outline of my talk. I want to make a case for word sense disambiguation in search. And I want to stress the necessity of this natural language processing step that has been neglected in search by telling you why I think, why we think at Powerset thinks this is important, especially if you're in the situation of semantic search as opposed to key word stretch. Then I'll stretch a proposal out which consists of two basic steps. One basic step is to induce a word sense inventory, so to take the senses and the model of senses from the corpus you actually apply your methods to. So in this case, in our case it's Wikipedia and the broader sense it would be the web or some sub domain or whatever you want to apply this to. This will involve graph clustering on a distributional thesaurus. And once we have that inventory and this inventory will also give us the clues with it how to disambiguate things in context. We will be able to build a word sense disambiguation component that is able to assign the correct sense out of that inventory to actual occurrences in the text. What we did in the first place was to set up a bag of words based system, very simplistic, and to evaluate that. I'm going to show you these results since they were very encouraging. We set out to build a variety of things. We decided to use the grammatical relations we get from parsing the whole Wikipedia and using these to build a distributional thesaurus to get better distinctions and also to shrink the amount of data we need to get nice results. We can always do more MLP and have less data or you can always increase the size of the data and have less MLP. I'm going to talk about how to build an evaluation corpus. It's not that it's not out there. But we work on Wikipedia, not on Wall Street Journal, on the news so we don't want to use sim core, we want to use our own corpus. And we found a pretty easy and cheap way to build an evaluation corpus for word sense disambiguation, and in the end they're going to give some outlooks further ideas to conclude. So why isn't there anything like a word sense disambiguation component and a standard state-of-the-art search engine. First of all, what is the problem? We have a bunch of words that are ambiguous in every language. I mean there's various ways they can be ambiguous. But one way is the semantic ambiguity. So some words have multiple senses. In the case of case, case could be container. It could be a lawsuit. It could be grammatical syntax case, whatever kind of case. And so if you see that word, we don't know what that means. And mostly what we observe is that the frequency distribution of the senses is highly skewed, both in our collection and in query logs if you use it in the search engine. And the skew doesn't necessarily match. So sometimes people overwhelmingly ask for one sense that is very rare in the index. Another problem is that it's almost impossible to determine the sense, the user [inaudible] when he's querying the index from that very short key word query. But what's most compelling and most deceiving, the reason why this never really got successfully implemented is what [inaudible] called the query word culcation effect. And that is the longer your query is, the more you have the effect that ambiguous words disambiguate each other. So if you query for case, you get this part of the document space that contains the word case. So it's a very simplistic view. But if you query for court case, you get the overlap between court and case here. And this is definitely not the overlap between plastic case, like to carry a case. So while it's easy to get ambiguous mixed results by entering one word, it's pretty hard to get these mixed results with three or four words. So if a user sees in the result that it's too mixed, you just enter more words and disambiguate that query. But this does not work if we are in the semantic search world and especially in the semantic search we're doing a Powerset which is expanding the index. So very rough overview. What we do at Powerset we don't only store the key words, we also extract predicate argument structures and we also expand them semantically with synonyms in various ways. So it's very nice that you can get results like this for the question who did Microsoft acquire. Like something was purchased from Microsoft. Somebody sold something to Microsoft. And these are the recall gaining not as much as we like. But, of course, if we don't do any disambiguation we might get spurious matches. If we ask in today's system who studied in prison, we get results about colleges because of some obscure inventory 2.1 use college British language for prison; you always have that effect no matter what inventory you need. So if you draw a picture here, and this is a little bit constructive, but what you get for a query of carry case is case would not only get you the documents that contain "case" but also other meanings like "suitcase" and "lawsuit." And "carry" would also be soliciting documents that contain "expect" because of the obscure meaning of be pregnant with somebody these two words have. If you put carry case and you get back lawsuit, users won't understand this. And this is partly a reason why our ranking function does not score semantic features as highly as we would like them to be. So let's see how we can tackle this problem, how we can repair that. And what I'm proposing here is something that is not quite standard to word disambiguation, what you usually see is this inventory, usually Word Net. There's this training corpus, usually sim core, and we want to make a system that assigns senses from that inventory to words in the text. And there's various ways to do it. And if you evaluate that best systems are in the ballpark of like 75 percent which is clearly not enough to sort out these kind of problems. And part of the reason is that this inventory is not meant for word sense disambiguation, Word Net was not constructed for search. Another reason is that so the structure of Word Net itself, it's too fine grained mostly. Another reason is that the resource Word Net does not necessarily match your underlying corpus. So if you go to biomedical domain and general purpose donate you're going to get horrible results if you go for most [inaudible] space line or something like that. So we want to go first induce the senses from our collection. And then to disambiguate them and the nice thing is this language in the main independent as long as we have the MLP machinery to provide the features we need here. We don't have that inventory mismatch because that is performed on the same corpus. So you would restart that process ever you have a new domain. And you have very little or almost non-manual work on lexico resources and inventories which makes this method really cheap and quickly applicable once you have set it up. And setting it up is not that trivial. But we're working on it. So what early put on the following prototype for our collection -- I'm sorry, this is very texty here, but I'll give you kind of an intuitive overview. For collection we computed significant occurrences. So words that come up together in sentences more often as you would expect if you would assume that they are independent and that uses their frequency counts and some significance measure. In this case it was likelihood, with some frequency threshold. So you know what I'm exactly talking about why you need frequency threshold here. So once you have all these co-occurrences, you can view this as a co-occurrence graph wherever word is a note and they're connected by a weight which is the significance of their co-occurrence and some words will not be connected. And now if you have a word we want to induce in inventory senses for, we locate this word in this graph, take the whole neighborhood, plus the connections in that neighborhood, and apply a graph clustering method, and hope that these clusters represent different usages of words. I wouldn't call it senses. I'm going to have an example in the next slide to show you the difference between senses and usages is. So once we have these clusters of words that co-occur with this word in different usages, we can use these clusters to determine what sense is present in a given context, in a given sentence, but just comparing the actual context to this global sense context. This is an example from the British national corpus for hip and co-occurrence with hip and these colors are like classes as returned by the graph clustering algorithm, and what we see is, okay, there's two big meanings of hip. One is like the body part hip and one is the music kind of hip so here we have a word like hip hop reggae and loop, mainstream, album. Here's hip hip hooray. That's my favorite one. And these things fall into like body part meaning as a co-thing and this is more medical and this is meanings of medical. This is some boxing position. So what we see here is very fine grained. It's like different fields where hip is used. But it's still the same hip you might think. Once we have this, and sentences are coming in, we can compare these sentences to these clusters and have some kind of trigger words that put these sentences into a class. So this sentence "A hip holds a pistol," and we know this sentence belongs to this cluster. Sometimes we have sentences that could belong two or more and these are actually original sentences with high overlaps. So here we find this and jazz and reggae. And here's a very simple view of how one could approach this and how we'd do it. That would not tell us yet what words we are able to substitute with hip. So this is not the full story here. But it's at least a step towards it. Because if you don't know what are the distinctions, we can't decide what to substitute. And when to substitute with different things. And it's not trivial, because hip in the medical sense is probably replaceable by its Latin term, but hip in a clothing usage probably not appropriate. So we set out to evaluate this, because like colorful cluster sets numbers are better for that. We used Amazon Turk. So it came up in an earlier meaning that not everyone knows what that is. Let me explain it to you. That's a service by Amazon. What you can do is you can put up human intelligence tasks that would be solved by artificial intelligence, which basically is humans performing stupid tasks, which could probably be performed by computers. And you give them something like this. So that's what they see. This is a task. Here's some instructions. And we ask them a question like: Are these words used for the same meaning or not. We give them a pair of sentences that contains the same word. I know you can't read it. And we let them decide whether this word has been used in the same sense, in a similar sense, in a different sense, or maybe they cannot tell. And we did that for groups of five. So each question is asked for five different people. That makes sense because people get paid for that. Not really much. That's ridiculous how much they get paid for that. But still people tend to pick randomly. So you would level that out. It's not high quality annotation, but you can always get good mileage by giving it out to more people. And what we found is if a clustering set, it was the same usage, Turkers came back with over 93 percent saying, yeah, it was identical or similar usage. Whereas if the clustering said it was different, it could have as well been identical, or clustering, the clustering tell us it splits too much. That's not critical as long as we get our substitutions right later. It will be much more critical if things that are not similar end up in the same group. And this might give you an idea of how good single Turkers are. This is for all judgments and this is for the majority vote. And the signal for the majority vote is much better than the single judgments. That's what we use in Amazon Turk approach. I'm going to show you some more of these tasks later. So we're encouraged by that. And we decided to start a project we call Sensinator, kind of a Powerset naming scheme to call anything innators, we have a volutionator and whatever innators. And this is like the two sentence big picture and what you see in green here is the components we use for that. So what we set out for is to use grammar features because we have this nice XLA parser. And we distinguish word usages, which we do by this graph clustering. And we don't do it around co-occurrences but along words that are contextually similar to other words. So have a high distribution or similarity. And that's the distributional thesaurus. Once we have this clustering, we compute global features for these clusters and these characteristics we use for disambiguation like measuring the overlap between cluster and document, or a larger chunk. So we don't necessarily want to go for single sentences, but we can go for several sentences and just disambiguate if we have enough evidence. So there's one sentence per discourse assumption hidden here. And since -- colorful clusters are good but control is better. We want to measure the quality against an evaluation set which we obtain by collaborative tagging. So we're going to use Amazon Turk to get an aberration corpus like that. And the remainder of the talk I'll dive into all these points. And so it's a pity you didn't buy us a little later because that's not quite finished yet. But we're working on that. So, first of all, what is our document representation? We don't do bag of words anymore. We have these grammar features. Now we do TOOPLs [phonetic] and a document is represented by like glued together TOOPLs extracted from grammar relations. We have relations between same part of speech, traditional part of speech. We treat verbs in different subset frames, words which makes sense when trying to disambiguate words. Because there's a high correlation between sub category and meaning. And how would that look like technically. So this is a sentence while leaving the facility briefly she quickly returned on February 22nd and I just highlighted TOOPLs for these three words leaving facility and she -what we get back from this, this is an article about Britney Spears. It was sentence No. 126. And for the word facility as a noun -- please ignore this funny characters, we have a feature. This word has been object of to leave and verb/subject/object frame with a frequency of one in that sentence. So you could think about it like aggregating over articles and having higher frequencies here. Vice versa, the word "to leave" in that sub cat frame has a verb that "facility" was in an object position. And "she" as a subject and vice versa for "she" we have that feature. So this is how we transform our documents in the first place. And this, of course, gives us a much more fine grained view on language than we would have obtained by bag of words in a sentence level where we could not be distinguishing between she and facility with respect to leaving. And from that we compute the distributional thesaurus, which is pretty much a standard component that you compute from the distributional similarity statistics. So you compare words along common contexts. And unlike another thesaurus, you have for each word a ranked list of entries. So it's not that it's a note that's populated by a bunch of words as in Microsoft Word thesaurus, but it looks much more like this. For example, the entry for "meeting" would have gathering with a rank of, a score of 56 and "seminar" with 49. And going down like this. And you should cut off somewhere maybe. There's a lot of parameters base to explore. And these words are not synonyms and these words are not substitutable. So sometimes they are. Meeting and gathering is a good one. But if you query for PowerPoint, you don't want to get Excel, you want to get PowerPoint. So these are words that behave similarly. What we see in the PowerPoint something maybe unexpected. There's a couple of file formats here, because you could save something in the PowerPoint format. So that's a different usage. So PowerPoint is ambiguous. Who would have thought that? And this is how we do it. This is how we arrive from a text, which broken down into this kind of TOOPLs over a couple of steps to a distribution of thesaurus, a bunch of parameters here. We compute significance between pairs because we don't want to have features that apply to everything like "large" as an adjective. Many things can be large. There's some pruning. And there's some trick here that avoids comparing each word to each other word. So you might think if you have a million words, we have to compare like a million words to a million words, this matrix is even too big for parallel computing. But we shortcut here a little bit by using the fact that this matrix of similarity will be sparse and we just compare things that have at least one feature in common. And all these boxes, all these steps are implemented as map reduced jobs. It's a little bit like cosmos in scope, in Microsoft lingo. So that's parallelized, and we compute distribution of thesaurus in about maybe four hours for Wikipedia on 60 cores. So that might easily scale up to large parts of, say, English web. Okay. So now we have this distribution of thesaurus, but since it's unsupervised and clustering, it's hard to evaluate how to assess quality. Since we didn't know how to evaluate this thing, we set out to invent a couple of evaluation methods and check whether they agree. >: Speaking of evaluation, on the previous slide, did you evaluate each like, for example, did you given these statistics about how many TOOPLs were extracted correctly versus incorrectly, or did you just evaluate [inaudible]. >> Chris Biemann: In this case we evaluate the whole end-to-end thing. Of course, it's not that all TOOPLs are correct. Other evaluations showed, please correct me, Dick, if I'm wrong, somewhere in the ballpark of 80, 90 percent. It depends on what TOOPLs you go for. And could be subject object relations and noun coordination and compounding which we all use. Distinguish obliques and adjuncts is kind of hard and we don't know whether there's a difference at all. So it's kind of a hard question. So they are evaluations like that in Powerset but not for that specific purpose. So we drummed up four different evaluations for this thesaurus, a visual evaluation, give that to team members, a couple of entries, couple of distribution thesauri and checked one which looks best. Very informal. We also measured Word Net overlap. So for a large number of nouns and verbs that have entries in the distribution thesaurus, we checked the word net distance. We used Jing, another measure for that since it's been proven useful in a number of tasks and checked for the top, say, 10 entries, what's the average distance in Word Net. And we would like to get the distribution thesaurus with the shortest distance, with the highest similarity. We also used the Turk [inaudible]. What is a Turkanyom. A Turkanyom is a synonym elicited by Amazon Turk. Very simple task. You give the word and ask Amazon Turk to return a list of synonyms, so people would just enter synonyms. And it's pretty good what comes out. We have like 250 pages. Very small experiment. And we check how often we can find them in the disposition thesaurus and how highly they are ranked because we want similar words. And synonyms are clearly similar words. And we also implemented, the most frequent sense finding method of McCarthy, et al. This is basically taking the distribution thesaurus entry and checking in Word Net what sense is most frequent by for checking overlaps of like different regions where you can find senses of the words and trying to find the highest one. Like a couple of sentences, I want to say, of random sample to get an idea of what is the most dominant sense and try to check whether we can find with that dispositional thesaurus. What we found, not surprisingly, that all four methods highly agreed. So even though they're kind of different, I mean Turkanyom rank and Word Net and overlap a little similar in the way the words also contain synonyms. What we found is that we can safely use the automatic methods which would be Word Net overlap and Turkanyom rank and not doing all these visual evaluation and sense distribution labeling since they give the same results. This is unfortunately only evaluating that the distribution thesaurus and not the whole system. But since we had too many parameters for that thesaurus we tried to like step back and do that first and get some rough idea what could be good. So what came out of that was if you want to compute a distribution of thesaurus for that purpose per word only use the 200 most significant features. You could use 300. But don't use much more. And the more frequent a feature is, the less weight it would have. So it's much more, giving you much more information if you have really rare feature that applies only to a bunch of words, then a very frequent feature like she as a subject or large as an adjective modifier, stuff like that. So now we have an idea what distribution of thesaurus to use. And we can go ahead and cluster it. And just as we saw with that hip example, the idea is that similar words are in that distributional thesaurus entry clustered together because they are each other -- they are contained in entries of each other and have a similarity score. And just like in the previous examples, cluster will serve as a sense of representations and used to build a context model. So we can collect the features of the words we find in the cluster and aggregate them up and use these in the context base disambiguation. For clustering, we use Chinese [indiscernible] graph clustering which has a bunch of nice properties. First and foremost it finds the number of clusters automatically. And this is crucial for word sense induction. You can't just set out and say, okay, everything has two meanings and you have a word like father which is super frequent, but it only has one meaning, basically. So another thing is it's very efficient. So we can easily do that for wide range of the frequency spectrum. And it's proven useful for natural language data in general, because of natural language. So graphs in use from metro language usually have properties found in the networks, like [indiscernible], small world, like click-ish [phonetic] nets and that comes in handy. So let me show you an example of our best-found distributional thesaurus. We clustered the entry of arm as a noun. And we used first 100 entries for that. And what we found is we get three clusters here. One is mostly body part. If you look closely, you could probably find things like arm chair, arm. And, yeah, since it's 100 words, of course, this is quite far away from arm. Some of them, like mirror. But, anyway, another cluster here is much smaller. It's firearm and the smallest is branch or the arm of the branch of the company. >: Suggest that you ->> Chris Biemann: Huh? >: An a adjustable rate mortgage. It's an acronym. >: Financial term. >> Chris Biemann: I didn't know that. That's good. I was wondering. >: Room is the outlier. >> Chris Biemann: Room is the outlier. [Laughter] bear in mind this is Wikipedia, so all those things of -- we've seen that a lot recently in news. Doesn't affect this. >: But I suspect Hammer is Arm and Hammer, the company. >> Chris Biemann: That may be a name entity recognition issue. Unfortunately, which is also algorithmic, so I cannot blame anybody else for that. >: So I have a question about you mentioned this distinction in the loading the distributions and ambiguity between things that are synonyms and things that are similarly distributed, but are not setting with each other like body parts or colors. >> Chris Biemann: Right. We cannot distinguish that. >: That's an interesting problem. I was wondering if you had any ->> Chris Biemann: I have something in the further idea section about that. But it's like a crucial core problem to everything like ontology, learning and taxonomy induction. It's a huge body of research about that. And nobody really came up with a really good solution, I guess. >: On a practical level so [inaudible] has this problem as well because of they do a lot of production on queries and logs off the suggestions of a whole number of things. [Inaudible] things that the distribution is working on synonyms. >> Chris Biemann: Right. In the query operation project, for example, if it's just two companies manufacturing the same thing, but you clearly, if you look for one you don't want the other. It's like this PowerPoint/Excel example. Okay. So let's look at the features we get from these clusters. So the body part feature looked like you could attach something to it. It's coordinated with legs, arm and leg or neck and leg and you can break it. It's coordinated with body. It can be long. It's round. So these are the most frequent words. So here's the frequency counts for that. For firearm, they look entirely different. And it's also striking that the kind of feature looks different. So the first part is always the kind of feature we have. So these are more the coordinations with other nouns, like arms and explosives or arms and ammunition, and you can grab it. You can use it. You can hold it. You could equip something with it. And for that branch or what was that mortgage -- >: Mortgaging is another one. >> Chris Biemann: But that looks more like the charitable arm of the company, right. >: Yeah. >> Chris Biemann: If you are going to have the problem that these kind of minor senses might get conflated together. But just skip that. On the other hand it's minor senses. So we have very low counts here. That means we have very low confidence here. That means for that low confidence stuff we just better leave it out, we don't expand. Maybe we start with the low-hanging fruit of the big senses first and just have a method of detecting whether or not we should do it. If you have a constructed sentence like this, like the commercial arm of the company produces arms and guns for people who broke their arms, then it's pretty clear that you could use this grammatical context here to disambiguate. All right, this is like a slide that illustrates how to arrive from these clusters to these features. So imagine we have that in place. It's currently being coded. Now we're going to, again, evaluate it. And for that we set out to build a sense labeled evaluation corpus. And it's a different kind of corpus as what we find in repositories in the literature, because basically what we want is sentences containing a target word, which is marked up as such. And we want to know whether this group of sentences, like these group of sentences that contain the word in the same meaning. So we're not interested in a positive definition of that meaning or a glass or like a definition. We just want to distinguish these meanings in the first place. And what we'd like to have after we have our search application, of semantic expansion in the index, is we want to substitutes these target words group meaning. So what's like a viable solution given that we know the meaning. What we'd like to have is that the distribution and the coverage of senses should be the same as in our target collection. So it should match this underlying corpus we draw these things from. This is another argument against sim core. And the senses should be gripped by the same substitutes and not by the same entities or whatever you might think as being viable distinctions on a semantic level. Which is of course having a lot of granularity. So coming back to that hip example, probably we do not want to group together the clothing hip and the medical hip. Different definitions. Probably we won't. I'm not sure about that yet. So this is a Turkanyom and bootstrap cycle that allows us to get an evaluation corpus. We actually executed that for a bunch of high frequency words as a pilot. So we start here. For each word, each word runs through that circle until it finally finishes. Each word we draw a couple of sentences that contain that word from our corpus. And we try to get different meanings so that we can use any word sense induction disambiguation system we have in place. In this case we used that bag of words one. That's not crucial, but it speeds up the process a little. Then you have a low number of sentences, say 10 sentences, and we get Turkanyom context, that is, we present a sentence with a marked up word to Turkers, and ask them to supply synonyms in that context. Then once we have these sentences with Turnkanyoms, we can cluster these sentences according to their Turnkanyoms. So putting these sentences that get the same substitutions and the same cluster and for each cluster we take the most prototypical example, however we get it. So we aim at one sentence per usage. So our corpus comes in again so we select, say, 100 random sentences from the corpus, and half Turk has matched the meaning against this one sentence per sentence. And we also give them an exit. We give them an exit. We know it's impossible to tell because that sentence is really too short. Or we give them an exit saying yes it's not covered in these sentences. So the new sentence that comes in is not in the sense model we already have. If we could reliably assign most of these sentences, then we assume that we captured most of the senses in the corpus will be done. If you have a lot uncovered, you send them up here and repeat. And go on like this. So this is how the Turk tasks look like. Again, a little small, but that's a simple task. Like here it's find substitutable words. The sentence with the molded word and they can enter substitutes. And we would pay them one cent for completing this. For match the meaning, it's much more instructions. And then we have a sentence, a marked up word. Our prototypical sentences plus the two axes uncovered, if possible. And you would put 10 sentences of these with the same target word in one hit and pay them three cents. So this is really, really cheap. But still it's quite reliable. So for case, it came back at some point with two senses. If you iterate longer, you get more, maybe. That was involved in that case, restriction case. If you look at these cases that have been aligned like, that it's all heuristic like lawsuit case. Whereas this one is an interesting one. For case of fabricated evidence. Bag of word model would probably assign a lawsuit there because it's fabricated evidence, which you can find in the coat room. This is of course like the cases in situation. And this is all reliably marked up here. And we want to use this as an evaluation corpus first and foremost. But we, of course, are looking into using that for supervised for situations in system and training. But you're going to see the results of our other system first. So what we currently have is for about 100 words. Sentences from about 15,000 articles and this has about 70,000 sentences, which total about 110,000 interesting TOOPLs. TOOPLs that contain our target word. And that costs us about 300 bucks. So this is really much, much less went into sim core and other kinds of things. It's not that elaborate. But for our purposes that's pretty much what we need. And these things we're going to use for evaluation. I'm slowly coming to the later stages, so right now I'm talking more about things that we want to do as opposed to things that we already did. And this is how to tackle the substitutes, because we get substitutes here. I mean from this process we get all these substitutes. We could use them but we only have that for evaluation corpus. What we want to have substitutes for all words. And what we can do is, in the first system, to map that out with Word Net. And that might be even desired, because there's a lot of NL pipeline steps and other resources like Frame Net and Sumo and I Know Word that are mapping to Word Net. Sometimes it might be advisable not to construct everything from scratch but take what's there and align it to it. There's this method that I mentioned earlier by McCarthy, et al., that has been used to determine the most frequent Word Net sense using a distributional thesaurus and looking at the hierarchy and seeing where the words of the entry come up and what is closest. And once we cluster this distribution thesaurus entry, we could do that not for the full entry but for the single clusters and maybe get some idea how to map Word Net senses to cluster senses. Which will not be at all trivial, because the granularities do not match and the distributions do not match. But at least some path to it. So for our arm example, there's this part of Word Net that has arm as branch and subdivision which is kind of a division and there is arm as a weapon and it has hypernyms as a weapon and an instrument. If you overlay that with a cluster result it's pretty clear which sense is meant here. And once we do that we could use the substitutions that Word Net gives us. So arm and three would be substitutable with weapon and arm and four would be substitutable with branch and subdivision. And if we would match query and passages we might even use hypernyms in all the things we currently do in the Power Set search engine >: Do you have the cases where you would have hypernyms that are not included in the Word Net evaluation? And typical example, tulip, it's not a flower. A rose is a bush, it's not a flower. >> Chris Biemann: Lots of [inaudible] things. >: How do you handle those cases because they must be ->> Chris Biemann: In our like first system for the substitutions we don't handle them. What I would hope is to find a way to get all these substitutable things from the corpus, either entirely in unsupervised or like weekly supervised by training that up with existing things or having Turkers take off the good things and throwing out the bad things. And I think this is the only way to go in the long run, because we want to adopt that to other domains, other languages and so forth. And also make it extendible to new lingo and terminology, which is clearly not the case if you have a fixed inventory. And also we're not concerned with it right now. We keep that in mind. And one thing is to probably use several data sources. So in the meeting before we saw this interesting similarity list like paraphrased list coming from machine translation, which looked pretty much similar to this distribution of thesaurus entries anyway. And you could think about pattern-based methods, like [inaudible] patterns like looking for XS is AY and combining these with resources, because very shallow pattern-based things, they hit a couple of good things and also a lot of noise. But the more sources you combine, the better you get at the real picture we would hope. Question. >: How much do you -- it's more of a suggestion, actually. >> Chris Biemann: We're getting very close. Maybe we could do that in discussion. >: Sure. >> Chris Biemann: Already in the further ideas. So the pattern thing is up here. Another thing what is up here is confidence first strategy, is like disambiguating only the really safe bets. So we have like a lot of, a lot of trigger words we know exactly this must be that sense. And once we have that we can learn new trigger words from these occurrences to self-train the model. And even use other triggers like domain-based triggers just bag of word features along these lines. And we can also use Turkers to align Word Net with our sense inventory if we want to and we can also use Turkers to join senses in Word Net because Word Net is kind of too splitty, makes too many distinctions for IR. So immediate next steps would be to evaluate how much we gain from the grammar features. We want to assess coverage. We want to directly evaluate that WSI and SD so induction disambiguation with the Turkers and eventually do relevance testing. For that we might need a targeted passage query and pairs. And on the front of feature engineering, we want to experiment with more and more precise grammar features. And maybe include these topical features, trigger words, or even like domain information. So to conclude, I showed you a data-driven approach to word sense induction disambiguation, which is characterized by minimal manual development, because the inventory is induced and it needs only parts of it, even if you don't have a parsers, you can always fall back to bag of words. It's possible to map that to existing inventories. I showed you some evaluation strategies, the next step. So this is the end of my talk. I want to thank you very much for your attention. I'm open to questions and comments and thank you very much. So that was the suggestion. >: So if there's a concern over finding words in this list that are safely substitutable, so you mentioned one reasonable option is to try to find overlap with patterns that also indicate substitutability. An idea I think that sometimes gets lost is there's actually very clear patterns that indicate the opposite of substitutability. And you can mine for those in text and do the sorts of tricks like bootstrapping a few words that aren't substitutable. You can find where they occur, the type of patterns where they occur in. Excel and PowerPoint actually co-occur all the time. Patterns that will tell you, you should never substitute them for each other. >: Like what kind of patterns? >: "And." >: "And" other than is and such as. >> Chris Biemann: That would qualify them as siblings, right? There's even a work on -- there's even work that boils that down to if you have a high frequency word between X and Y and you see them also in the other direction, then they are siblings. That's ACL 2006 paper. I was impressed with that works for 20 languages at least. That's a good one. That's so we can learn a lot from all that ontology induction literature. Just even trying to get the relations right. Because if we know that we might mine for what is the common hypernym. >: I know that a lot of people in this audience are interested in ontologies. Is there a follow-on story for ontology induction? >> Chris Biemann: I digged into ontology induction a couple years ago. I bailed out and thought it was not worth the trouble, because in the strict sense of ontology as a formal ontology with all that entails, like you want to do induction on that inference. And a lot of things. And what I saw so far is that there's this top level ontologies that come from top down, and these kind of methods would come from bottom up. And there's the gap. And this gap is not easily bridged. So you could probably do something like populations. So if you have quite a good skeleton, a good taxonomy, you could find, say, instances. You could put in named entities in the various places. I think these kind of things would work. But constructing that bottom up from scratch will not lead you to the top level. Because these things are not expressed at all. And I think humans do not learn that by language but by interaction with the world. If you look in corporal, that's the part we're missing. That was a very high level. Ontology view. >: Ontology induction is building ontology from using natural language processing to build ontology. >> Chris Biemann: Well, it depends on what you -- yeah. >: The induction. >> Chris Biemann: Yeah, I was speaking about first getting that taxonomic backbone that is always overlying ontology by looking at text and applying MOP methods to get at this. But that doesn't mean that these are not right. >: You can use ontologies in other ways. We can use it to represent, to extract the information in ontology structures so you can query on things. >> Chris Biemann: Not explicitly. What we do is we query free base and extract a lot of facts from free base, and free base is kind of a big database for, say, named entities and a lot of data on them. So we have a component that tries to translate a query into a free base query. It's kind of natural language interface to a database, so to speak. And we would get this what you call ontology from free base. We don't maintain our own ontology. What we do is we do something like instance classification into the Word Net hierarchy. And currently we're using mostly Word Net hierarchy. > Lucy Vanderwende: Thank you very much, Chris. I think everybody knows the e-mail address. So please continue to ask questions. Thank you very much. I think that unsupervised methods and induction of word sense is a very -- it's one of the most interesting new directions of the field. So thank you. [Applause]

> Lucy Vanderwende: My name is Lucy Vanderwende, and... Chris Biemann, and the purpose of the presentation is first...

Related documents

Products

Support

&gt; Lucy Vanderwende: My name is Lucy Vanderwende, and... Chris Biemann, and the purpose of the presentation is first...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

> Lucy Vanderwende: My name is Lucy Vanderwende, and... Chris Biemann, and the purpose of the presentation is first...