> Lucy Vanderwende: My name is Lucy Vanderwende, and... Chris Biemann, and the purpose of the presentation is first...

advertisement
> Lucy Vanderwende: My name is Lucy Vanderwende, and I have the pleasure of introducing
Chris Biemann, and the purpose of the presentation is first and foremost to introduce Chris and
his work to a much wider audience than he would otherwise be able to reach by having one
meeting at a time.
Chris comes with Powerset and has come from Powerset, but only joined a couple of months
ago, in January. So quite recent, having finished his Ph.D. at the University of Leipzig the end of
last year.
And I'll say that there's another purpose, which would have taken place anyway, because as a
group we've been very much interested in what Chris is doing on unsupervised modeling of
language. And so this talk would have been -- we would have hosted Chris no matter where he
was. It just happens that he's now a member of Microsoft.
So that's fun. But when I heard that he had gone to Powerset. I'm like that's too bad, I didn't
know Chris was looking around because I was going to talk to him about maybe doing a post doc
here. So here he is, and we're very happy to welcome Chris to Microsoft.
>> Chris Biemann: Thank you very much, Lucy. Thank you for the introduction. And thank you
for coming in so large numbers, which I would not have expected, frankly.
I'm here today to talk about some recent work our group in Powerset has been doing on word
sense induction and word sense disambiguation. And the purpose of this talk is two-fold. First of
all, I want to tell you all these nice things that we pursued and the nice ideas we've had and how
we tried to evaluate them. But the other purpose is to get you excited about this new platform
you just bought at Microsoft which is the Powerset Natural Language Search platform, which
could be used to plug in many things you probably might not have dreamed of, given the situation
before you bought Powerset.
So I hope that this will trigger a lot of discussions and talks in the remainder of the day and the
weeks to come. So this is the outline of my talk. I want to make a case for word sense
disambiguation in search. And I want to stress the necessity of this natural language processing
step that has been neglected in search by telling you why I think, why we think at Powerset thinks
this is important, especially if you're in the situation of semantic search as opposed to key word
stretch.
Then I'll stretch a proposal out which consists of two basic steps. One basic step is to induce a
word sense inventory, so to take the senses and the model of senses from the corpus you
actually apply your methods to. So in this case, in our case it's Wikipedia and the broader sense
it would be the web or some sub domain or whatever you want to apply this to.
This will involve graph clustering on a distributional thesaurus. And once we have that inventory
and this inventory will also give us the clues with it how to disambiguate things in context. We will
be able to build a word sense disambiguation component that is able to assign the correct sense
out of that inventory to actual occurrences in the text.
What we did in the first place was to set up a bag of words based system, very simplistic, and to
evaluate that. I'm going to show you these results since they were very encouraging.
We set out to build a variety of things. We decided to use the grammatical relations we get from
parsing the whole Wikipedia and using these to build a distributional thesaurus to get better
distinctions and also to shrink the amount of data we need to get nice results.
We can always do more MLP and have less data or you can always increase the size of the data
and have less MLP.
I'm going to talk about how to build an evaluation corpus. It's not that it's not out there. But we
work on Wikipedia, not on Wall Street Journal, on the news so we don't want to use sim core, we
want to use our own corpus.
And we found a pretty easy and cheap way to build an evaluation corpus for word sense
disambiguation, and in the end they're going to give some outlooks further ideas to conclude.
So why isn't there anything like a word sense disambiguation component and a standard
state-of-the-art search engine. First of all, what is the problem? We have a bunch of words that
are ambiguous in every language. I mean there's various ways they can be ambiguous. But one
way is the semantic ambiguity. So some words have multiple senses. In the case of case, case
could be container. It could be a lawsuit. It could be grammatical syntax case, whatever kind of
case.
And so if you see that word, we don't know what that means. And mostly what we observe is that
the frequency distribution of the senses is highly skewed, both in our collection and in query logs
if you use it in the search engine.
And the skew doesn't necessarily match. So sometimes people overwhelmingly ask for one
sense that is very rare in the index. Another problem is that it's almost impossible to determine
the sense, the user [inaudible] when he's querying the index from that very short key word query.
But what's most compelling and most deceiving, the reason why this never really got successfully
implemented is what [inaudible] called the query word culcation effect. And that is the longer your
query is, the more you have the effect that ambiguous words disambiguate each other.
So if you query for case, you get this part of the document space that contains the word case. So
it's a very simplistic view. But if you query for court case, you get the overlap between court and
case here. And this is definitely not the overlap between plastic case, like to carry a case.
So while it's easy to get ambiguous mixed results by entering one word, it's pretty hard to get
these mixed results with three or four words. So if a user sees in the result that it's too mixed,
you just enter more words and disambiguate that query. But this does not work if we are in the
semantic search world and especially in the semantic search we're doing a Powerset which is
expanding the index.
So very rough overview. What we do at Powerset we don't only store the key words, we also
extract predicate argument structures and we also expand them semantically with synonyms in
various ways.
So it's very nice that you can get results like this for the question who did Microsoft acquire. Like
something was purchased from Microsoft. Somebody sold something to Microsoft. And these
are the recall gaining not as much as we like. But, of course, if we don't do any disambiguation
we might get spurious matches. If we ask in today's system who studied in prison, we get results
about colleges because of some obscure inventory 2.1 use college British language for prison;
you always have that effect no matter what inventory you need.
So if you draw a picture here, and this is a little bit constructive, but what you get for a query of
carry case is case would not only get you the documents that contain "case" but also other
meanings like "suitcase" and "lawsuit." And "carry" would also be soliciting documents that
contain "expect" because of the obscure meaning of be pregnant with somebody these two words
have.
If you put carry case and you get back lawsuit, users won't understand this. And this is partly a
reason why our ranking function does not score semantic features as highly as we would like
them to be.
So let's see how we can tackle this problem, how we can repair that. And what I'm proposing
here is something that is not quite standard to word disambiguation, what you usually see is this
inventory, usually Word Net. There's this training corpus, usually sim core, and we want to make
a system that assigns senses from that inventory to words in the text.
And there's various ways to do it. And if you evaluate that best systems are in the ballpark of like
75 percent which is clearly not enough to sort out these kind of problems.
And part of the reason is that this inventory is not meant for word sense disambiguation, Word
Net was not constructed for search.
Another reason is that so the structure of Word Net itself, it's too fine grained mostly. Another
reason is that the resource Word Net does not necessarily match your underlying corpus. So if
you go to biomedical domain and general purpose donate you're going to get horrible results if
you go for most [inaudible] space line or something like that.
So we want to go first induce the senses from our collection. And then to disambiguate them and
the nice thing is this language in the main independent as long as we have the MLP machinery to
provide the features we need here.
We don't have that inventory mismatch because that is performed on the same corpus. So you
would restart that process ever you have a new domain.
And you have very little or almost non-manual work on lexico resources and inventories which
makes this method really cheap and quickly applicable once you have set it up. And setting it up
is not that trivial. But we're working on it.
So what early put on the following prototype for our collection -- I'm sorry, this is very texty here,
but I'll give you kind of an intuitive overview. For collection we computed significant occurrences.
So words that come up together in sentences more often as you would expect if you would
assume that they are independent and that uses their frequency counts and some significance
measure. In this case it was likelihood, with some frequency threshold. So you know what I'm
exactly talking about why you need frequency threshold here. So once you have all these
co-occurrences, you can view this as a co-occurrence graph wherever word is a note and they're
connected by a weight which is the significance of their co-occurrence and some words will not
be connected.
And now if you have a word we want to induce in inventory senses for, we locate this word in this
graph, take the whole neighborhood, plus the connections in that neighborhood, and apply a
graph clustering method, and hope that these clusters represent different usages of words. I
wouldn't call it senses. I'm going to have an example in the next slide to show you the difference
between senses and usages is.
So once we have these clusters of words that co-occur with this word in different usages, we can
use these clusters to determine what sense is present in a given context, in a given sentence, but
just comparing the actual context to this global sense context.
This is an example from the British national corpus for hip and co-occurrence with hip and these
colors are like classes as returned by the graph clustering algorithm, and what we see is, okay,
there's two big meanings of hip. One is like the body part hip and one is the music kind of hip so
here we have a word like hip hop reggae and loop, mainstream, album. Here's hip hip hooray.
That's my favorite one.
And these things fall into like body part meaning as a co-thing and this is more medical and this is
meanings of medical. This is some boxing position. So what we see here is very fine grained.
It's like different fields where hip is used. But it's still the same hip you might think.
Once we have this, and sentences are coming in, we can compare these sentences to these
clusters and have some kind of trigger words that put these sentences into a class. So this
sentence "A hip holds a pistol," and we know this sentence belongs to this cluster. Sometimes
we have sentences that could belong two or more and these are actually original sentences with
high overlaps. So here we find this and jazz and reggae. And here's a very simple view of how
one could approach this and how we'd do it.
That would not tell us yet what words we are able to substitute with hip. So this is not the full
story here.
But it's at least a step towards it. Because if you don't know what are the distinctions, we can't
decide what to substitute. And when to substitute with different things.
And it's not trivial, because hip in the medical sense is probably replaceable by its Latin term, but
hip in a clothing usage probably not appropriate.
So we set out to evaluate this, because like colorful cluster sets numbers are better for that. We
used Amazon Turk. So it came up in an earlier meaning that not everyone knows what that is.
Let me explain it to you. That's a service by Amazon. What you can do is you can put up human
intelligence tasks that would be solved by artificial intelligence, which basically is humans
performing stupid tasks, which could probably be performed by computers.
And you give them something like this. So that's what they see. This is a task. Here's some
instructions. And we ask them a question like: Are these words used for the same meaning or
not. We give them a pair of sentences that contains the same word. I know you can't read it.
And we let them decide whether this word has been used in the same sense, in a similar sense,
in a different sense, or maybe they cannot tell. And we did that for groups of five. So each
question is asked for five different people. That makes sense because people get paid for that.
Not really much. That's ridiculous how much they get paid for that.
But still people tend to pick randomly. So you would level that out. It's not high quality
annotation, but you can always get good mileage by giving it out to more people.
And what we found is if a clustering set, it was the same usage, Turkers came back with over 93
percent saying, yeah, it was identical or similar usage. Whereas if the clustering said it was
different, it could have as well been identical, or clustering, the clustering tell us it splits too much.
That's not critical as long as we get our substitutions right later. It will be much more critical if
things that are not similar end up in the same group. And this might give you an idea of how
good single Turkers are. This is for all judgments and this is for the majority vote.
And the signal for the majority vote is much better than the single judgments. That's what we use
in Amazon Turk approach. I'm going to show you some more of these tasks later.
So we're encouraged by that. And we decided to start a project we call Sensinator, kind of a
Powerset naming scheme to call anything innators, we have a volutionator and whatever
innators.
And this is like the two sentence big picture and what you see in green here is the components
we use for that. So what we set out for is to use grammar features because we have this nice
XLA parser. And we distinguish word usages, which we do by this graph clustering.
And we don't do it around co-occurrences but along words that are contextually similar to other
words. So have a high distribution or similarity. And that's the distributional thesaurus. Once we
have this clustering, we compute global features for these clusters and these characteristics we
use for disambiguation like measuring the overlap between cluster and document, or a larger
chunk.
So we don't necessarily want to go for single sentences, but we can go for several sentences and
just disambiguate if we have enough evidence. So there's one sentence per discourse
assumption hidden here.
And since -- colorful clusters are good but control is better. We want to measure the quality
against an evaluation set which we obtain by collaborative tagging. So we're going to use
Amazon Turk to get an aberration corpus like that.
And the remainder of the talk I'll dive into all these points. And so it's a pity you didn't buy us a
little later because that's not quite finished yet. But we're working on that.
So, first of all, what is our document representation? We don't do bag of words anymore. We
have these grammar features. Now we do TOOPLs [phonetic] and a document is represented by
like glued together TOOPLs extracted from grammar relations. We have relations between same
part of speech, traditional part of speech. We treat verbs in different subset frames, words which
makes sense when trying to disambiguate words.
Because there's a high correlation between sub category and meaning. And how would that look
like technically.
So this is a sentence while leaving the facility briefly she quickly returned on February 22nd and I
just highlighted TOOPLs for these three words leaving facility and she -what we get back from
this, this is an article about Britney Spears. It was sentence No. 126. And for the word facility as
a noun -- please ignore this funny characters, we have a feature. This word has been object of to
leave and verb/subject/object frame with a frequency of one in that sentence.
So you could think about it like aggregating over articles and having higher frequencies here.
Vice versa, the word "to leave" in that sub cat frame has a verb that "facility" was in an object
position. And "she" as a subject and vice versa for "she" we have that feature.
So this is how we transform our documents in the first place. And this, of course, gives us a
much more fine grained view on language than we would have obtained by bag of words in a
sentence level where we could not be distinguishing between she and facility with respect to
leaving.
And from that we compute the distributional thesaurus, which is pretty much a standard
component that you compute from the distributional similarity statistics.
So you compare words along common contexts. And unlike another thesaurus, you have for
each word a ranked list of entries. So it's not that it's a note that's populated by a bunch of words
as in Microsoft Word thesaurus, but it looks much more like this. For example, the entry for
"meeting" would have gathering with a rank of, a score of 56 and "seminar" with 49. And going
down like this.
And you should cut off somewhere maybe. There's a lot of parameters base to explore. And
these words are not synonyms and these words are not substitutable. So sometimes they are.
Meeting and gathering is a good one.
But if you query for PowerPoint, you don't want to get Excel, you want to get PowerPoint. So
these are words that behave similarly. What we see in the PowerPoint something maybe
unexpected. There's a couple of file formats here, because you could save something in the
PowerPoint format.
So that's a different usage. So PowerPoint is ambiguous. Who would have thought that? And
this is how we do it. This is how we arrive from a text, which broken down into this kind of
TOOPLs over a couple of steps to a distribution of thesaurus, a bunch of parameters here. We
compute significance between pairs because we don't want to have features that apply to
everything like "large" as an adjective. Many things can be large.
There's some pruning. And there's some trick here that avoids comparing each word to each
other word. So you might think if you have a million words, we have to compare like a million
words to a million words, this matrix is even too big for parallel computing.
But we shortcut here a little bit by using the fact that this matrix of similarity will be sparse and we
just compare things that have at least one feature in common. And all these boxes, all these
steps are implemented as map reduced jobs. It's a little bit like cosmos in scope, in Microsoft
lingo. So that's parallelized, and we compute distribution of thesaurus in about maybe four hours
for Wikipedia on 60 cores. So that might easily scale up to large parts of, say, English web.
Okay. So now we have this distribution of thesaurus, but since it's unsupervised and clustering,
it's hard to evaluate how to assess quality.
Since we didn't know how to evaluate this thing, we set out to invent a couple of evaluation
methods and check whether they agree.
>: Speaking of evaluation, on the previous slide, did you evaluate each like, for example, did you
given these statistics about how many TOOPLs were extracted correctly versus incorrectly, or did
you just evaluate [inaudible].
>> Chris Biemann: In this case we evaluate the whole end-to-end thing. Of course, it's not that
all TOOPLs are correct. Other evaluations showed, please correct me, Dick, if I'm wrong,
somewhere in the ballpark of 80, 90 percent. It depends on what TOOPLs you go for. And could
be subject object relations and noun coordination and compounding which we all use.
Distinguish obliques and adjuncts is kind of hard and we don't know whether there's a difference
at all. So it's kind of a hard question. So they are evaluations like that in Powerset but not for
that specific purpose.
So we drummed up four different evaluations for this thesaurus, a visual evaluation, give that to
team members, a couple of entries, couple of distribution thesauri and checked one which looks
best. Very informal.
We also measured Word Net overlap. So for a large number of nouns and verbs that have
entries in the distribution thesaurus, we checked the word net distance. We used Jing, another
measure for that since it's been proven useful in a number of tasks and checked for the top, say,
10 entries, what's the average distance in Word Net.
And we would like to get the distribution thesaurus with the shortest distance, with the highest
similarity. We also used the Turk [inaudible]. What is a Turkanyom. A Turkanyom is a synonym
elicited by Amazon Turk. Very simple task. You give the word and ask Amazon Turk to return a
list of synonyms, so people would just enter synonyms. And it's pretty good what comes out. We
have like 250 pages. Very small experiment.
And we check how often we can find them in the disposition thesaurus and how highly they are
ranked because we want similar words. And synonyms are clearly similar words.
And we also implemented, the most frequent sense finding method of McCarthy, et al. This is
basically taking the distribution thesaurus entry and checking in Word Net what sense is most
frequent by for checking overlaps of like different regions where you can find senses of the words
and trying to find the highest one.
Like a couple of sentences, I want to say, of random sample to get an idea of what is the most
dominant sense and try to check whether we can find with that dispositional thesaurus. What we
found, not surprisingly, that all four methods highly agreed. So even though they're kind of
different, I mean Turkanyom rank and Word Net and overlap a little similar in the way the words
also contain synonyms.
What we found is that we can safely use the automatic methods which would be Word Net
overlap and Turkanyom rank and not doing all these visual evaluation and sense distribution
labeling since they give the same results.
This is unfortunately only evaluating that the distribution thesaurus and not the whole system.
But since we had too many parameters for that thesaurus we tried to like step back and do that
first and get some rough idea what could be good.
So what came out of that was if you want to compute a distribution of thesaurus for that purpose
per word only use the 200 most significant features. You could use 300. But don't use much
more. And the more frequent a feature is, the less weight it would have.
So it's much more, giving you much more information if you have really rare feature that applies
only to a bunch of words, then a very frequent feature like she as a subject or large as an
adjective modifier, stuff like that.
So now we have an idea what distribution of thesaurus to use. And we can go ahead and cluster
it. And just as we saw with that hip example, the idea is that similar words are in that
distributional thesaurus entry clustered together because they are each other -- they are
contained in entries of each other and have a similarity score.
And just like in the previous examples, cluster will serve as a sense of representations and used
to build a context model. So we can collect the features of the words we find in the cluster and
aggregate them up and use these in the context base disambiguation.
For clustering, we use Chinese [indiscernible] graph clustering which has a bunch of nice
properties. First and foremost it finds the number of clusters automatically.
And this is crucial for word sense induction. You can't just set out and say, okay, everything has
two meanings and you have a word like father which is super frequent, but it only has one
meaning, basically.
So another thing is it's very efficient. So we can easily do that for wide range of the frequency
spectrum. And it's proven useful for natural language data in general, because of natural
language. So graphs in use from metro language usually have properties found in the networks,
like [indiscernible], small world, like click-ish [phonetic] nets and that comes in handy.
So let me show you an example of our best-found distributional thesaurus. We clustered the
entry of arm as a noun. And we used first 100 entries for that. And what we found is we get
three clusters here. One is mostly body part. If you look closely, you could probably find things
like arm chair, arm. And, yeah, since it's 100 words, of course, this is quite far away from arm.
Some of them, like mirror.
But, anyway, another cluster here is much smaller. It's firearm and the smallest is branch or the
arm of the branch of the company.
>: Suggest that you ->> Chris Biemann: Huh?
>: An a adjustable rate mortgage. It's an acronym.
>: Financial term.
>> Chris Biemann: I didn't know that. That's good. I was wondering.
>: Room is the outlier.
>> Chris Biemann: Room is the outlier. [Laughter] bear in mind this is Wikipedia, so all those
things of -- we've seen that a lot recently in news. Doesn't affect this.
>: But I suspect Hammer is Arm and Hammer, the company.
>> Chris Biemann: That may be a name entity recognition issue. Unfortunately, which is also
algorithmic, so I cannot blame anybody else for that.
>: So I have a question about you mentioned this distinction in the loading the distributions and
ambiguity between things that are synonyms and things that are similarly distributed, but are not
setting with each other like body parts or colors.
>> Chris Biemann: Right. We cannot distinguish that.
>: That's an interesting problem. I was wondering if you had any ->> Chris Biemann: I have something in the further idea section about that. But it's like a crucial
core problem to everything like ontology, learning and taxonomy induction.
It's a huge body of research about that. And nobody really came up with a really good solution, I
guess.
>: On a practical level so [inaudible] has this problem as well because of they do a lot of
production on queries and logs off the suggestions of a whole number of things. [Inaudible]
things that the distribution is working on synonyms.
>> Chris Biemann: Right. In the query operation project, for example, if it's just two companies
manufacturing the same thing, but you clearly, if you look for one you don't want the other. It's
like this PowerPoint/Excel example.
Okay. So let's look at the features we get from these clusters. So the body part feature looked
like you could attach something to it. It's coordinated with legs, arm and leg or neck and leg and
you can break it. It's coordinated with body. It can be long. It's round. So these are the most
frequent words. So here's the frequency counts for that.
For firearm, they look entirely different. And it's also striking that the kind of feature looks
different. So the first part is always the kind of feature we have. So these are more the
coordinations with other nouns, like arms and explosives or arms and ammunition, and you can
grab it. You can use it. You can hold it.
You could equip something with it. And for that branch or what was that mortgage --
>: Mortgaging is another one.
>> Chris Biemann: But that looks more like the charitable arm of the company, right.
>: Yeah.
>> Chris Biemann: If you are going to have the problem that these kind of minor senses might
get conflated together. But just skip that. On the other hand it's minor senses. So we have very
low counts here. That means we have very low confidence here. That means for that low
confidence stuff we just better leave it out, we don't expand. Maybe we start with the low-hanging
fruit of the big senses first and just have a method of detecting whether or not we should do it.
If you have a constructed sentence like this, like the commercial arm of the company produces
arms and guns for people who broke their arms, then it's pretty clear that you could use this
grammatical context here to disambiguate.
All right, this is like a slide that illustrates how to arrive from these clusters to these features.
So imagine we have that in place. It's currently being coded. Now we're going to, again,
evaluate it. And for that we set out to build a sense labeled evaluation corpus. And it's a different
kind of corpus as what we find in repositories in the literature, because basically what we want is
sentences containing a target word, which is marked up as such.
And we want to know whether this group of sentences, like these group of sentences that contain
the word in the same meaning. So we're not interested in a positive definition of that meaning or
a glass or like a definition. We just want to distinguish these meanings in the first place. And
what we'd like to have after we have our search application, of semantic expansion in the index,
is we want to substitutes these target words group meaning. So what's like a viable solution
given that we know the meaning.
What we'd like to have is that the distribution and the coverage of senses should be the same as
in our target collection. So it should match this underlying corpus we draw these things from.
This is another argument against sim core. And the senses should be gripped by the same
substitutes and not by the same entities or whatever you might think as being viable distinctions
on a semantic level. Which is of course having a lot of granularity.
So coming back to that hip example, probably we do not want to group together the clothing hip
and the medical hip. Different definitions. Probably we won't. I'm not sure about that yet.
So this is a Turkanyom and bootstrap cycle that allows us to get an evaluation corpus.
We actually executed that for a bunch of high frequency words as a pilot. So we start here. For
each word, each word runs through that circle until it finally finishes. Each word we draw a
couple of sentences that contain that word from our corpus. And we try to get different meanings
so that we can use any word sense induction disambiguation system we have in place.
In this case we used that bag of words one. That's not crucial, but it speeds up the process a
little. Then you have a low number of sentences, say 10 sentences, and we get Turkanyom
context, that is, we present a sentence with a marked up word to Turkers, and ask them to supply
synonyms in that context.
Then once we have these sentences with Turnkanyoms, we can cluster these sentences
according to their Turnkanyoms. So putting these sentences that get the same substitutions and
the same cluster and for each cluster we take the most prototypical example, however we get it.
So we aim at one sentence per usage. So our corpus comes in again so we select, say, 100
random sentences from the corpus, and half Turk has matched the meaning against this one
sentence per sentence. And we also give them an exit. We give them an exit. We know it's
impossible to tell because that sentence is really too short. Or we give them an exit saying yes
it's not covered in these sentences.
So the new sentence that comes in is not in the sense model we already have. If we could
reliably assign most of these sentences, then we assume that we captured most of the senses in
the corpus will be done.
If you have a lot uncovered, you send them up here and repeat. And go on like this. So this is
how the Turk tasks look like. Again, a little small, but that's a simple task. Like here it's find
substitutable words. The sentence with the molded word and they can enter substitutes.
And we would pay them one cent for completing this. For match the meaning, it's much more
instructions. And then we have a sentence, a marked up word. Our prototypical sentences plus
the two axes uncovered, if possible. And you would put 10 sentences of these with the same
target word in one hit and pay them three cents. So this is really, really cheap.
But still it's quite reliable. So for case, it came back at some point with two senses. If you iterate
longer, you get more, maybe. That was involved in that case, restriction case. If you look at
these cases that have been aligned like, that it's all heuristic like lawsuit case.
Whereas this one is an interesting one. For case of fabricated evidence. Bag of word model
would probably assign a lawsuit there because it's fabricated evidence, which you can find in the
coat room. This is of course like the cases in situation.
And this is all reliably marked up here. And we want to use this as an evaluation corpus first and
foremost. But we, of course, are looking into using that for supervised for situations in system
and training. But you're going to see the results of our other system first. So what we currently
have is for about 100 words. Sentences from about 15,000 articles and this has about 70,000
sentences, which total about 110,000 interesting TOOPLs. TOOPLs that contain our target word.
And that costs us about 300 bucks. So this is really much, much less went into sim core and
other kinds of things. It's not that elaborate. But for our purposes that's pretty much what we
need.
And these things we're going to use for evaluation. I'm slowly coming to the later stages, so right
now I'm talking more about things that we want to do as opposed to things that we already did.
And this is how to tackle the substitutes, because we get substitutes here. I mean from this
process we get all these substitutes. We could use them but we only have that for evaluation
corpus.
What we want to have substitutes for all words. And what we can do is, in the first system, to
map that out with Word Net. And that might be even desired, because there's a lot of NL pipeline
steps and other resources like Frame Net and Sumo and I Know Word that are mapping to Word
Net. Sometimes it might be advisable not to construct everything from scratch but take what's
there and align it to it.
There's this method that I mentioned earlier by McCarthy, et al., that has been used to determine
the most frequent Word Net sense using a distributional thesaurus and looking at the hierarchy
and seeing where the words of the entry come up and what is closest.
And once we cluster this distribution thesaurus entry, we could do that not for the full entry but for
the single clusters and maybe get some idea how to map Word Net senses to cluster senses.
Which will not be at all trivial, because the granularities do not match and the distributions do not
match. But at least some path to it.
So for our arm example, there's this part of Word Net that has arm as branch and subdivision
which is kind of a division and there is arm as a weapon and it has hypernyms as a weapon and
an instrument. If you overlay that with a cluster result it's pretty clear which sense is meant here.
And once we do that we could use the substitutions that Word Net gives us. So arm and three
would be substitutable with weapon and arm and four would be substitutable with branch and
subdivision. And if we would match query and passages we might even use hypernyms in all the
things we currently do in the Power Set search engine
>: Do you have the cases where you would have hypernyms that are not included in the Word
Net evaluation? And typical example, tulip, it's not a flower. A rose is a bush, it's not a flower.
>> Chris Biemann: Lots of [inaudible] things.
>: How do you handle those cases because they must be ->> Chris Biemann: In our like first system for the substitutions we don't handle them. What I
would hope is to find a way to get all these substitutable things from the corpus, either entirely in
unsupervised or like weekly supervised by training that up with existing things or having Turkers
take off the good things and throwing out the bad things.
And I think this is the only way to go in the long run, because we want to adopt that to other
domains, other languages and so forth. And also make it extendible to new lingo and
terminology, which is clearly not the case if you have a fixed inventory.
And also we're not concerned with it right now. We keep that in mind. And one thing is to
probably use several data sources.
So in the meeting before we saw this interesting similarity list like paraphrased list coming from
machine translation, which looked pretty much similar to this distribution of thesaurus entries
anyway. And you could think about pattern-based methods, like [inaudible] patterns like looking
for XS is AY and combining these with resources, because very shallow pattern-based things,
they hit a couple of good things and also a lot of noise. But the more sources you combine, the
better you get at the real picture we would hope. Question.
>: How much do you -- it's more of a suggestion, actually.
>> Chris Biemann: We're getting very close. Maybe we could do that in discussion.
>: Sure.
>> Chris Biemann: Already in the further ideas. So the pattern thing is up here. Another thing
what is up here is confidence first strategy, is like disambiguating only the really safe bets. So we
have like a lot of, a lot of trigger words we know exactly this must be that sense. And once we
have that we can learn new trigger words from these occurrences to self-train the model. And
even use other triggers like domain-based triggers just bag of word features along these lines.
And we can also use Turkers to align Word Net with our sense inventory if we want to and we can
also use Turkers to join senses in Word Net because Word Net is kind of too splitty, makes too
many distinctions for IR.
So immediate next steps would be to evaluate how much we gain from the grammar features.
We want to assess coverage. We want to directly evaluate that WSI and SD so induction
disambiguation with the Turkers and eventually do relevance testing. For that we might need a
targeted passage query and pairs.
And on the front of feature engineering, we want to experiment with more and more precise
grammar features. And maybe include these topical features, trigger words, or even like domain
information.
So to conclude, I showed you a data-driven approach to word sense induction disambiguation,
which is characterized by minimal manual development, because the inventory is induced and it
needs only parts of it, even if you don't have a parsers, you can always fall back to bag of words.
It's possible to map that to existing inventories. I showed you some evaluation strategies, the
next step. So this is the end of my talk. I want to thank you very much for your attention. I'm
open to questions and comments and thank you very much.
So that was the suggestion.
>: So if there's a concern over finding words in this list that are safely substitutable, so you
mentioned one reasonable option is to try to find overlap with patterns that also indicate
substitutability. An idea I think that sometimes gets lost is there's actually very clear patterns that
indicate the opposite of substitutability. And you can mine for those in text and do the sorts of
tricks like bootstrapping a few words that aren't substitutable. You can find where they occur, the
type of patterns where they occur in. Excel and PowerPoint actually co-occur all the time.
Patterns that will tell you, you should never substitute them for each other.
>: Like what kind of patterns?
>: "And."
>: "And" other than is and such as.
>> Chris Biemann: That would qualify them as siblings, right? There's even a work on -- there's
even work that boils that down to if you have a high frequency word between X and Y and you
see them also in the other direction, then they are siblings. That's ACL 2006 paper. I was
impressed with that works for 20 languages at least. That's a good one. That's so we can learn a
lot from all that ontology induction literature. Just even trying to get the relations right. Because if
we know that we might mine for what is the common hypernym.
>: I know that a lot of people in this audience are interested in ontologies. Is there a follow-on
story for ontology induction?
>> Chris Biemann: I digged into ontology induction a couple years ago. I bailed out and thought
it was not worth the trouble, because in the strict sense of ontology as a formal ontology with all
that entails, like you want to do induction on that inference.
And a lot of things. And what I saw so far is that there's this top level ontologies that come from
top down, and these kind of methods would come from bottom up. And there's the gap. And this
gap is not easily bridged. So you could probably do something like populations.
So if you have quite a good skeleton, a good taxonomy, you could find, say, instances. You
could put in named entities in the various places. I think these kind of things would work.
But constructing that bottom up from scratch will not lead you to the top level. Because these
things are not expressed at all. And I think humans do not learn that by language but by
interaction with the world.
If you look in corporal, that's the part we're missing. That was a very high level. Ontology view.
>: Ontology induction is building ontology from using natural language processing to build
ontology.
>> Chris Biemann: Well, it depends on what you -- yeah.
>: The induction.
>> Chris Biemann: Yeah, I was speaking about first getting that taxonomic backbone that is
always overlying ontology by looking at text and applying MOP methods to get at this. But that
doesn't mean that these are not right.
>: You can use ontologies in other ways. We can use it to represent, to extract the information in
ontology structures so you can query on things.
>> Chris Biemann: Not explicitly. What we do is we query free base and extract a lot of facts
from free base, and free base is kind of a big database for, say, named entities and a lot of data
on them. So we have a component that tries to translate a query into a free base query. It's kind
of natural language interface to a database, so to speak. And we would get this what you call
ontology from free base.
We don't maintain our own ontology. What we do is we do something like instance classification
into the Word Net hierarchy. And currently we're using mostly Word Net hierarchy.
> Lucy Vanderwende: Thank you very much, Chris. I think everybody knows the e-mail address.
So please continue to ask questions. Thank you very much. I think that unsupervised methods
and induction of word sense is a very -- it's one of the most interesting new directions of the field.
So thank you.
[Applause]
Download