24372 >> Emily Prud'hommeaux: So my name is Emily Prud'hommeaux. ... graduate student at the Center for Spoken Language Understanding at

advertisement
24372
>> Emily Prud'hommeaux: So my name is Emily Prud'hommeaux. I'm a
graduate student at the Center for Spoken Language Understanding at
Oregon Health and Science University. And I work with Brian Roark.
And today I'll be talking about graph-based alignment of narratives for
automated neurological assessment.
So to motivate the
neuropsychological
different types of
as many vegetables
narrative and have
problem a little bit, I'd like to point out that
exams often rely on spoken responses elicited by
stimuli. For instance, you might be asked to name
as you can in a minute or you might listen to a
to retell it.
And, of course, these are usually part of normed instruments. So
they're standardized scoring procedures that are usually carried out on
the responses, but you can also think of the responses as a source of
language data that could be analyzed for other kinds of diagnostic
markers.
So I'll be talking about narratives today. There are different kinds
of narratives that can be elicited. Personal narratives. You can do
narrative generation, where you narrate a picture book or a story from
a series of pictures or cartoon or video clip. Or this is one of the
most common ones is that the subject listens to a story and has to
retell the story to the examiner.
And that's usually scored in terms of how many story elements were
used. So there's information in the original story. And they're
graded on how well they replicated that information.
Now, seniors with Alzheimer dementia produce fewer of these key story
elements in their narrative retellings. And so the question here is
whether we can use these narrative scores to detect very early stages
of dementia.
The earliest stage of dementia that's usually recognized is called mild
cognitive impairment. I'll talk a little bit about that in detail
shortly.
So can we identify it? Can we identify MCI using these narrative
scores, and also can we do it automatically, because we're computer
scientists and we like to do things automatically.
So the goal is to develop an objective automated tool for MCI for
detecting MCI that relies on scores derived from narrative retellings.
And the way we're going to do this is to think about retelling as kind
of a translation. So imagine that the story, you hear it and you
translate it into your own language. It's the same language, but it's
your own special idioelec of that language. So I'm going to use
existing word alignment tools that are used for machine translation to
create machine translation style word alignments between retellings and
a source narrative and improve those alignments using a graph-based
method.
And then I'm going to extract scoring information from the alignments
and use those scores as features for diagnostic classification.
So a little overview of the data. I mentioned mild cognitive
impairment. It's characterized by impairments in cognition that are
significant but don't interfere with your daily life so you can still
drive your car and balance your checkbook. You know who your
grandchildren are and all those things. But it is clinically
significant. It's real. It's happening. Your cognition is starting
to decline.
But because it's so subtle, it's very hard to detect with something
like the mini mental state exam. Instruments like that that are used
to screen for dementia. MCI is diagnosed with a long interview by an
expert between the patient, with the patient and also with someone who
can corroborate what the patient says, like a spouse or other family
member.
And we're going to use the clinical dementia rating, which is one of
the techniques for diagnosing different levels of dementia. And it is
one of these interview-based techniques. Typical aging is a clinical
dementia rating of zero. MCI is a clinical dementia rating of 0.5.
This is how we'll interpret it.
Severe dementia would be like a clinical dementia rating of 3. And I'd
like to point out the diagnosis does not rely on any narrative recall
task. So the narrative recall task is completely independent of the
process by which MCI is diagnosed.
So this is our data. We have, at OHSU we're a medical school, we have
something called the Layton Center for Aging and Alzheimer's Disease
Research. And they have this very long NIA-funded longitudinal study
on brain imaging where people come in, they give them the mini mental
state exam, have them do a bunch of activities and tasks in their
interviews.
And during this exam they're given the interview for which, by which
you can diagnose MCI. And so we have 72 subjects with MCI and 163
subjects without. There are no significant differences in age or years
of education.
And there's also additional 26 subjects who have more advanced dementia
or were too young to be eligible for this study or had a diagnosis that
changed back from MCI to typical aging, and so we didn't want to
include them in either diagnostic group since it's not clear which one
they're actually in.
The narrative task we'll be talking about is the Wechsler logical
memory sub test, the Wechsler memory scale. The Wechsler memory scale
is a widely used instrument used to assess memory in adults. Been used
for 70 years. The story that we'll be talking about has actually been
used for 70 years as well.
So the subject, the examiner reads a story to the subject and the
subject has to retell that story immediately and then after 20 minute
delay. The score is how many story elements were used in each
retelling. The identities of the story elements are noted. But
they're not used as part of the score for the Wechsler memory scale.
This is the narrative itself. And you can see that the slashes denote
the boundaries between elements. So there are 25 story elements. This
is a sample retelling.
And the underlined items are the recalled, correctly recalled elements.
So you can see that Ann is a correctly recalled version of Anna. But
Taylor is not a correctly recalled version of Thompson. And that's
because the published scoring guidelines give pretty explicit
directions about what are acceptable lexical substitutions and what are
not, what sorts of paraphrases would be acceptable.
This person gets a score of 12 out of 25. So I mentioned that I'm
thinking of retelling a translation and that I want to do word
alignment. So instead of translating from German to English, I want to
translate from the story to the way someone rendered that story.
And so the way the statistical machine translation works typically is
you begin with this sentence, parallel corpus of sentences where you
have sentences in one language on one side of the corpus and
translations of those sentences on the other side of the corpus in a
different language.
And the idea is that you need to figure out which words are
translations of which other words. And the way you do that is through
word alignment. And this could be easy. It could just be like you
monotonically go through and line them all up, or it could be really
complicated where there's lots of word order differences and things
like that.
So you can't just use Levenshtein distance or something like that.
have to do something smarter. This is usually done using the IBM
models which were developed in the '90s when they first, when they
You
started getting interested in machine translation again after many
years. I'm going to use expectation maximization to figure out which
words align with one another.
And there are two widely used word alignment packages. Actually Giza
is widely used Berkeley Aligner is not widely used but I found it gives
more accurate alignments and it gives you out of model and align other
data you didn't train on which if it's possible with Giza it's not
immediately obvious how you would do it.
It also saves out the posterior probabilities for those alignments.
That will be important for the graph-based method I'll be discussing.
I'll use these alignments to extract scores from narrative retellings,
which I'll then use for diagnostic classification.
So I said you needed a parallel corpus. And I have three parallel
corpora I've created just from the retelling data that I have, just
people retelling the story.
So the first one is small. The second one is a little bit bigger and
the third one is huge. So the first one is the source to retelling
corpus.
All I have on this side is the source narrative, and all I have on that
side are the retelling. This is about 500 or so lines long because
that's how many retelling we have. Corpus too is a word identity
corpus. We're just saying the word cook should align to the word cook.
And because we're doing monolingual alignment, this is a good
assumption to make.
If you see a word on one side and you see the word on the other side,
there is some high probability that they're going to align to each
other.
The third corpus, the huge corpus is a retelling-to-retelling corpus.
So every pair of retelling. So this is 500 squared, 250,000 lines.
So what I do is I use the Berkeley Aligner to build two models. The
small model is just built on corpus one and corpus two, just the
retelling to source in the word identities.
The large model is built on corpus, all three corpora. The difference
in it, takes like two or three minutes to build the first one. It
takes like 12 to 24 hours to build the second one. Very, very big
difference in time required to build these kinds of models.
So I'm going to test both models on the two retelling for each of the
experimental subjects so we can see how well they align to one another.
And then I'm going to use both models, because you can save out the
models with the Berkeley Aligner to align every retelling to every
other retelling. I'll be using that in my graph-based model.
So these are the -- I'm just looking at time. Okay. So these are the
results of the alignment, the precision recall and alignment error
rate. Alignment error rate is like word error rate. It's sort of a
measure of the precision and recall of how many alignment pairs were
found in the proposed alignment that existed in the gold manual
alignment.
You can see that you get a very large reduction in AER as you move from
the small model to the big model. Almost four points. Which is very,
very large improvement for word alignment. But those are still pretty
high.
State of the art on the Euro [inaudible] corpus where the Berkeley
Aligner is four and we're at 20. So we can do a lot better.
So the idea we had was to use a graph-based method that uses random
walks on graphs. So you probably all know about page rank. Google's
way of ranking Web pages. And it has to do with you build this graph
and the nodes are Web pages, and the edges of the hyperlinks between
those Web pages.
And if you were to just walk around on the graph you created like that,
the nodes that you end up on would be the more prestigious nodes, the
more important nodes.
Lex rank is a way of ranking sentences for automatic summarization. So
for word alignment, the nodes are words and retellings and the source
and edges are the normalized posterior weighted alignments proposed by
the Berkeley Aligner. We know the alignments and what their posterior
reliabilities were.
Imagine if you had these four sentences. The source is on the bottom.
These are sentences from our subject. You can see the bold face words
should all be aligning to touch. But suppose they didn't. Suppose
that when you did your alignment of just the sourced retelling you only
got moved aligning to touch and sorry aligning to touch.
So then you align every other retelling, every retelling to every other
retelling, and you uncover the relationship that moved aligns with
sympathetic, and sorry aligns with sympathetic. You want to get from
sympathetic to touch. You couldn't do that in your original alignment.
Once you build this graph and start walking around on it, you can start
at sympathetic and go to moved and from there you can go to touched.
So the idea is that it creates a connection between words that maybe
were unaligned in your original alignment that you can now uncover by
virtue of the relationships that word has with other words in other
retellings.
So we build a graph using the alignments and posteriors generated by
Berkeley Aligner. This is the way the graph works. You started in
node which is a retelling word. With some probability you move to a
word in another retelling. And with some other probability you walked
to a source word and you break, and you do this a thousand times for
every retelling word. And the destination source word you end up at
the most often is your new alignment. So you have a distribution at
the end of a thousand walks of which retelling word, source word you
ended up on, you picked the most frequent one. That's your new
alignment.
You can tune the value of this lambda on those 26 ineligible
participants. They can serve as DEV set for this. I do this for the
alignments for small and large model.
We can see here if you take the small model, apply the graph-based
method to the alignments proposed by it, you actually get over four
point reduction. So it's a larger reduction than you get just by
moving to the large model and keep in mind that the large model takes
12 to 24 hours. This graph-based model takes three minutes tops.
It's very fast. So you're getting the same benefit with very,
requiring very, very few computational resources. We also see that the
graph-based model outperformed both of their correspondingly sized
models as well.
So this bodes very well for using graph-based models. So now I want to
extract scores from those alignments. And I'll explain how I do it.
This is the narrative. The elements are labeled with letters of the
alphabet. The 25 elements. And this is a word alignment that the
Berkeley Aligner proposed. So what we do is we look at the word from
the retelling. We got rid of the function words because who cares
about them.
And we look at a word in retelling and it aligns to a source word, and
we see where does that source word appear in the original narrative.
Appears in element A. This person got element A. Taylor aligns to
Thompson, where does Thompson appear? In element B. Worked, employed,
element E and so on. That's how we get the score. We know for every
element did they get it or not.
So I evaluated the scores that you can extract from this and actually
under all the models the S measure is very, very high. In fact, I'm
giving you the Cohen's kappa. That's the measure of the interannotator
agreement. This is actually within the range of human interannotator
agreement on this task. So this is a computer performing as well as a
person on the task of scoring this test.
And in addition we see that the models with the lower alignment error
rate produce the higher scoring measures which we're glad to know. So
now we want to use these scores for diagnostic classification. So what
we do we extract scores from each subject through retellings, and we
use the support vector machine to classify the subjects. We have two
feature sets. One feature set is just the summary score. So for each
retelling, zero to 25, how many elements did they get. This is the
score that's reported as part of the Wechsler memory scale.
The second feature set is the 50 element score. So for each retelling
there are 25 elements. So we create a vector of 50 scores each being 0
or 1 depending on whether they've got that element correct.
And then we're going to evaluate this in terms of area under the
receiver operating characteristic curve using the pair validation. So
with AUC .5 is chance. And 1 is perfect. So anything greater than .5
is good and the closer it is to one the better it is. So we can see
here that the summary score feature -- the summary score is what's
reported, normally, is it does pretty well. It gets closest to about
.75 for most of them. But actually the element score features are
much, much better.
And they're getting very high classification accuracy. And another
thing the note is that the clinical dementia rating has a reliability
at about that level.
So you could sort of say that this technique is working as well as
humans are at actually distinguishing mild cognitive impairment. And
again we see the large models and the graph-based models are stronger
than the ones that are small and not graph-based.
So can I finish? It says stop. All right. Well, you're on your own
for the summary, people. So the methods outlined here shows potential
of screening tool for neurological disorders because it's not just this
test that's widely used. There's another test called the NEPSY
narrative memory that's used for kids, and description tasks, easy to
adapt for other scenarios.
And the other thing that was good was the graph-based methods yielded
large alignment error rate improvements without requiring the extensive
computational resources that scaling up to a really large model would.
So other improvements in the graph-based model right now just
one-to-one alignment because I just picked the single highest most
frequent destination node over the distribution. But there are plenty
of one-to-many alignments in our data. That's something I would like
to look into. I'd also like to look into using undirected links and
allowing links out of the source where I feel you could be exploring
the graph a little bit more than I am. I'd like to apply the technique
to other tasks like I said the NEPSY narrative memory task and cookie
theft picture description. You look at a picture of a kid stealing
cookies from a cookie jar and have to describe it. I'm working right
now in incorporating speech recognition into the pipeline with Miter,
who is going to be talking about something similar in her poster, and I
also want to try to apply the graph-based method to multi-lingual word
alignment. [inaudible] has lots and lots of languages, seems like
something like this might be able to be used to improve word alignment
in some way. So that's all. [applause].
>>: Questions?
>> Darlene Wong:
Hi.
Hello.
>>: Hi. Sorry so what is the accuracy of the alignments if you just
map same word to same word?
>> Emily Prud'hommeaux: It's like 40. Maybe 30 or 40. So it
should -- the thing is like it should be good because the probability
that a word aligns to that identical word is quite high. It's like
60 percent. But the problem is that there are multiple instances of
words.
So you have to decide which word you're going to align it to.
think that's where the inaccuracy comes from.
And I
>>: You talked about kids. Were you talking specifically about the
Wechsler test for kids or other tests for kids and what's the
neurological impairment, because I'm assuming it's not dementia at that
point.
>> Emily Prud'hommeaux: Very early on set dementia, it's a really
terrible problem. So the test specifically I was talking about was the
NEPSY narrative memory test. The NEPSY is like a huge battery of
things that test not just memory but language skills and executive
function and all of those different things. And actually I've already
applied these methods to that and the results -- they're not quite as
compelling but we don't have as large a dataset but they're pretty
good.
But also if you're familiar with autism at all, there's an instrument
called the Autism Diagnostic Observation Schedule that's used to, a set
of series, series of semi structured activities. One is called the
word list picture book. And the kid and the examiner together narrate
a picture book. And this is something -- this is a technique that
could probably be applied to something like that as well. The
neurological impairments we're interested in, are autism and language
impairment. The NEPSY is tied to language impairment performance on
that.
>>: Hi. One more question. So were you doing the alignment on the
entire, like -- so it wasn't like sentence-based and then match?
>> Emily Prud'hommeaux: So it wasn't sentence-based. It was the full
retelling to the full source narrative. Which is weird. That's not
what you would do in machine translation. Machine translation you
would have it have sentences aligned but because we don't know in
advance which parts of the story the adults are going to remember, you
can't do -- you'd have to do a sentence alignment first. And the
different story elements might appear in one sentence that appeared in
two different sentences and the source. So we just put it all
together.
>>: Because from the point of view of a source of error, like the IBM
models are kind of designed to work on sentences and kind of imagining
that you're doing an alignment on much longer dependencies.
>> Emily Prud'hommeaux: Yes, that is almost certainly why, one of the
many reasons why the alignment error rates are much higher than they
would be for machine translation.
>>: Thank you very much.
[applause]
>> Congle Zhang: Hello, everyone. My name is Congle. And this work
is done together with Raphael Hoffmann and Dan Weld. Dan is my advisor
and very glad to come here to talk with you about this work.
So as you may know, relation extraction is a very important part of
natural language processing and artificial intelligence. I hope this
simple example can help you understand the goal of relational
extraction.
Suppose we have a row of sentences, like "Our captain example of Jenali
Jenkins is a phenomenal athlete, said the Gator's head coach Urban
Meyer."
So after human being read these sentences, he can get some interesting
facts like the Jenali Jenkins plays football team like data Urban Meyer
college football play later. So the question can the machines do the
same thing as human beings, get these facts and input it into the
machines.
And so formally relation extraction is a problem that we have in
ontologists with set of relations with type signatures. For example an
athlete is playing for a team. We have type signatures athlete and
team plays for team. So we are interested in the facts of this
relation. So we want an extractor to, the input of this extractor is
some row sentences. The output of the extractor is some two posts
satisfying the relations you are interested in.
So suppose the relation you extracted from the row sentence for for
example like Jenali Jenkins and Urban Meyer does not exist in your
knowledge base, you can add this instance back to your knowledge base,
add this instance back to your ontology, and your ontology is becoming
larger, better and more useful. So that's the goal of the relation
extraction, and that's the first step to build a better knowledge base
to use for other tasks.
Okay. At a first glance you may think that supervised learning is the
best way to do this task. So in order to build a classifier for the
extractor, what you need is a lot of training sentences. For each
training sentences you need to figure out the entities in the sentences
and their relationships.
For example, like Ray Allen and Doc Rivers, they satisfy the
relationship culture by you, put it, you label it as a positive
example. And YouTube and Google, you know that Google acquired YouTube
actually, but this sentence like YouTube API and Google code say
nothing about acquiring relationship so you have to label this as an
active example.
Okay. Supervised learning is good but what's the problem? The problem
is it cannot scale easily. So let's see the example like the 82005
dataset contains only 1500 articles. So the reason that supervised
learning is hard is not -- the problem is that it is not only hard but
it's almost impossible. Because the data is very -- the positive data
is very skewed. So most sentences actually do not contain any
interesting relationships in your ontology, in your small set of
relations. So here is the dataset. So the ratio of positive sentences
actually is less than 2 percent. So for the top 50 relations in free
base which means that if you ask a human labeler to label your
sentences, they will meet one positive sentence after 15 active ones.
So I don't think many people have that patience to label enough data
for this task.
To avoid -- so researchers want to avoid labeling this very -- to avoid
this kind of labeling sentences stuff, so they propose weeks of
revision to leverage the instances.
The idea is that, okay, it's hard to get labels, sentences, but it's
easier to get label relation instances. For example, we know that
Gator and Urban Meyer, they satisfy the relationship coach. And we
know Google and YouTube satisfy the relation required. And it's easy
to get a list of unlabeled sentences from whatever you can imagine.
The clever part of the weak supervision is you heuristically generate
the training examples by matching the instances into the unlabeled
sentences and then return all sentences that contain this pair of
entities as the training, as the training example for your classifier,
for extractors. Of course, it brings in some noises to your extractor,
but since the number of unlabeled sentences is so huge, you can do a
lot of machine learning stuff on this kind of interesting data.
Okay. Life is so good, until we ask a question. So what if the
training instances here is also small, because the previous work during
the weak supervision, they try to avoid this problem by only looking at
the relation that's already existing in the database, which means that
it can get large amounts of training instances for free almost.
So but what if you want to define a relation by yourself for some bio
task, for some whatever you want for some question answering, the
relation may be very specific. You only have some examples for that.
You don't know which database to look at. You don't know where to look
at.
So we want to solve this problem. So our motivation is that there are
some very large background ontology on Web; you can leverage these
background ontology. They contain like millions entities and thousands
of relations.
So what if we can build a connection between our target ontology, our
target relations to this background around ontology, and they're very
likely among these millions entities a lot of entity peers satisfy my
target relations, if I can dig them out, use them as the training
instances, I can significantly increase the number of training
instances for my task and I can do weak supervision very well.
So that's the idea. We record ontological smoothing. So our goal I
have a relation, want to do training instances, I generate the training
instances from the background ontology. The method is we do ontology
mapping from my target relation to the background knowledge base, to
the background ontology.
Here is the overview of our system. We call it velvet. The first step
is to build a mapping from the background ontology to the target
ontology. The second step is to generate new training instances and
training sentences from the mapping and from the unlabeled sentences
and the third step is to train relation extractor with this new data.
Okay. So what's the challenge to do this job? There are two major
challenges for our ontological smoothing idea. The first challenge is
that there might not be some explicit mappings. You may ask a
question, can I use a very naive way to pick an automatic relation in
the background knowledge and return all instances as much as what you
need? The problem is that usually good mappings is not explicit in
your background knowledge.
You need some database operator like join, like union, like selections
to get what you want. Here is a simple example. For example, in your
target relation, target you have a relation. Coached by, it has an
example.
there is
what you
the team
for team
Kobe Bryant and Mike Brown. So in your background knowledge,
no direct connection between Kobe Bryant and Mike Brown, but
know is Kobe Bryant plays for the team LA Lakers and coach of
is Mike Brown. For human beings, we can see if we join plays
and team coach I can get a lot of two posts.
If I take the first argument and the third argument of these two posts,
I get a lot of good training instances for the relation coached by,
that's what I want.
But how can a machine do that? So another besides join, union is also
important, because the same relation can be spirited into different
domains in your background knowledge, and they may even have different
names. Very hard to get their relationship at the first glance. For
example, like the head coach, the team coaches are named as the head
coach in basketball domain and manager in baseball domain. What you
want is to put them together in our system.
So the output of our system is a view, a database view with the
operator of join union selection. And build this -- build mappings
from the target relation to a view over the background knowledge base.
The second challenge is that we do need to put, do need to do the
entity type and relation mapping as a whole, jointly do the mapping.
The reason is that the entity is very ambiguous in the background
knowledge. For example, you can see that there are many [inaudible]
background knowledge is one is basketball coach, one is football
player, another is a politician.
So without context,
talking about, your
coached by relation
have big confidence
you're looking for.
how could you know which one is the guy you are
target relation? But under the condition that
is mapped from plays for team, join team coach, you
that the basketball coach Mike Brown is the guy
Okay. So we can handle these difficulties, these challenges in our
system by two step, by break down into two step. The first step is to
generate mapping candidates and the second step is to choose the
mapping candidates by the joint inference.
So we first put the background knowledge into a graph with each node
stand for, each node is entity in the background knowledge and each
edge is a relation in the knowledge.
So we look for the instance pairs in our target relation.
the path between them as the relation mapping candidates.
We return
And then we return type of these nodes as the type mapping candidates.
Such kind of method is sometimes, has some problem. It's noisy. You
can see
This is
pars as
mapping
that Kobe is born in U.S. and Mike's nationality is also U.S.A.
a parse between the two arguments and you'll return this noisy
a candidate. So we need to specify the likelihood to these
candidates, in order to get correct ones.
So what we do is to use the Markov knowledge network model which is
good to use the joint inference with the evidence writing in first
order logic. The probability of the event in Markov logic is written
by the number of satisfied rules times their weight. And in our work
we have three kinds of predicates. One for entity mapping. One for
relational mappings and another for the type mappings. We write these
predicates together in first order logic rules. Actually they are
features for our observations and we conduct an MAP inference to get
predicates, to get truth value for the predicates, which means that the
predicate is true, which means the mapping is good.
Okay. To do the MAP inference we simply cost the problem into a
interlinear programming and do LP relaxation and to surroundings to get
the result, which is quite standard method in textbook. So you can
check it. Okay. So here we are.
Where we are is that we have the ontological mapping. We have a lot of
new training instances and examples. What we need is relation
extractor, a weak supervised method to do the relational extraction.
We use Mozart [phonetic] in this project. It is developed by Raphael
Hoffmann in our group. So it might be, to my best of knowledge it
might be out-of-shelf relational extractor you can find now. And it
scales very well to like millions examples, datasets. So it's useful
tool.
Okay. So here's our experiment. We compare our system velvet against
three baselines by taking the joint inference away, by taking the
complex mapping away by taking a smooth inference away from system. We
do the experiment on two target ontology, NELL and IC. NELL there are
43 relations. IC there are nine relations. We use the free base which
contains like 100 million. In fact, as our background ontology.
So we do the experiment and labeled datasets, label sentences of New
York Times, which contains almost millions of articles and like 50
million sentences.
Here is our performance. So you can see that without smoothing, it's
not surprising that without smoothing performance is very bad because
the system is trained only by like dozen training examples, by putting
without, by putting complex mapping and ontological smoothing into the
system the performance is improved significantly and velvet is the best
of all. So last slide shows the figures averaged by instances. We can
also average by relations.
This result, the result in the last slides may be a little too
optimistic because big relations is easier to do than the small
relations. This is average by relations. We can also see that velvet
is much better than baselines.
We also compare our system, our method, to the current state-of-art
supervised extractions and kernel, data 40 data set we use two
state-of-art supervised approaches as comparison and but we do the
experiment, our system use very few training instances. Only ten grand
instances per relation. There's no sentence annotation. It's
surprising like that. We can achieve comparable result to the state of
our supervised method. It's because we use large number of unlabeled
dataset. So they are trained like by a thousand sample sentences. We
are using like millions unlabeled sentences. That's why we can get
comparable performance.
Okay. We also evaluate performance of ontology mapping itself. We
manually label some mapping results, and we achieve like 88 accuracy,
relation extraction 93 on entity mapping.
The result is like 5 percent better than the baseline which is free
base internal search API to get entity mapping, which is about 88.
Okay. So our system -- so we noticed that the previous week
supervision doesn't scale very well, because if you have very few
training instances, so our solution is to use background knowledge,
background ontology, to generate ontology mapping to bring you some, a
large number of new instances. It can enhance relation extraction
performance very well.
Well, here's some future work. For example, we're planning to bring
more, bring multiple ontologist, do mapping to multiple ontologist.
We're interested in -- so not only just binary relation but [inaudible]
relations. For weak supervisions there's a lot of space to improve
like the data is very skewed and there are -- and the future is like -is extremely high dimension. Sometimes makes the performance quite
hard to improve. So there are some future work to improve.
There it is.
[applause]
>>: Thank you very much.
We have some time for questions.
>>: I wanted to ask you a question about N-ary relations and then you
put it as future work. And but I was also thinking, do you really need
N-ary relations, because maybe you can do everything with binary
relations?
So where do you see the benefit of dealing with N-ary relations?
>> Congle Zhang: So I think N-ary relation is that you sometimes, so
you can break -- you can break the N-ary relation into several binary
relations and there are some major entities. That's the simple way to
do the N-ary relation. But I'm not sure if it's good for all
situations. Maybe there are some cases it will not work very well.
For example, if a lot of -- if many arguments are like the date or
numbers, I don't think it will work, because I'm not sure if it will
work because these arguments are related to each other. If you treat
each of them as binary there may not be as many linguistic evidence to
get them. I don't know. Maybe it's the case.
>>: I might have missed this, but did you evaluate the precision of the
relations you're generating from the smoothing or are they sort of
100 percent correct? So you have some seed relations and then you get
more, or some seed examples and you get more from free base, are those
like 100 percent ->> Congle Zhang: I know what you mean what's the position of the
examples? I didn't label that. Yeah, this is a good question. So we
only label that by using them and to train extractor. But I didn't
sample the instances to see the performance. Good question.
>>: Makes sense.
>> Congle Zhang:
We didn't do that.
>>: All right. Thank you very much. We'll move on to our next -[applause]
-- we'll move on to our last oral presentation. Next Max Whitney.
>> Max Whitney: Okay. So I'm Max, and this is work with my supervisor
Anoop Sarcar, and the topic of our paper is bootstrapping which is the
case of semi supervised learning where there's a single domain and the
small amount of labeled data or seed rules.
Okay. Okay. So bootstrapping. And in particular we're looking at the
Yarowsky algorithm, which is a simple self-training algorithm with
decision list. So we start with some seed decision list. We label the
data and we train a new decision list and repeat. And a decision list
looks like this. The running example here is word sentence
disambiguation from the word sentence. So one sense is a piece of text
and one sense is a punishment.
So here we have two seed rules. A rule is a score and a feature and a
label, a sense. And the decision list works just by choosing the
highest ranked rule, which had the feature matching the example.
So here when we apply the seed rule to data, the first two examples in
the data tend to be labeled, because they have a feature that matches
the rule.
matches.
And the third one can't be because it has no feature that
So that's the first two steps. And the third step we train a new
decision list. And the scores are coming from statistics over the
currently labeled data. It's basically co-occurrence with the previous
rules.
And you can see now we have more rules. We can label more of the data.
But we have a threshold. We only take good enough rules based on the
score.
So we're never guaranteed to be able to label all the data, just
whatever we have featured for. And we repeat. At the end of all of
this we'll drop the threshold, make a decision list with no threshold
and then we can label all the data. So we do that for testing.
This is sort of similar to EM, or
difference is that we're training
contains type level information.
counts over the actual instances,
hard EM, at least. But the
this decision list model, which
Whereas EM is, you have expected
the token level.
So to examine the behavior of this in more detail we're going to
continue with the same example. Now the current decision list is on
that side and we're just showing the two senses, list the colors. And
the currently labeled training data is going to be down there. Also
labeled by color.
So in this graph, the left bar is going to represent the decision list
and the right bar is going to represent the currently labeled training
data. And up here we're going to show accuracy. The current test
accuracy with the decision list.
So as we proceed, the decision list, that's the left bar, is growing
very rapidly. And we've now labeled I believe that's all of the test,
the training data.
And you can see both are kind of skewed towards the blue label,
whichever sense that is. And now that we've labeled all the data, the
accuracy kind of plateaus. And we converge there, about 60 percent
accuracy.
So the next variation is Yarowsky-cautious from Cohen and Singer.
didn't say that we're using Cohen and Singer for our particular
specification of Yarowsky and Yarowsky-cautious.
I
And here cautious, when we make a decision list, we take the top five
rules by score and each iteration we take five more. Decision list is
now going to grow linearly. And you can see that whereas in the
previous example the axis went up to 1800, now it's only up to 400
because we're controlling the decision list.
So as we run it, the decision list, that the left bar is growing
linearly, and it's balanced between the two features now. And the
coverage and the accuracy are going to grow more gradually and we
converge at a higher accuracy. So now this is our noncautious and this
is our cautious.
Both will change when we do a retraining step and we drop the threshold
and we drop cautiousness and train big decision lists but usually we
see this behavior that cautious avoids getting stuck at lower accuracy.
So the reason we're interested in Yarowsky is that it seems to do
pretty well. The first four here are own reimplementation of Cohen and
Singer. And the last is a version from Abney, which we'll talk about
later. And you can see here that the co-training algorithm and
Yarowsky-cautious are the best of that set.
And Cohen and Singer also have an algorithm called co-boosting, which
does comparably in their results. We haven't tested it.
And you can also see there's a caution algorithm. Yarowsky-cautious
and co-training which does something similar are the only ones that get
up to about 90 percent.
So cautious is important. And the reason we're interested in Yarowsky
over the other algorithms is that co-training and co-boosting require
two views, and the views are supposed to be independent. So it's nice
to be able to drop that limitation and just use self-training.
So that was the upside. The downside is we don't have very good
theoretical analysis for Yarowsky, there's no proven bounds. Abney in
this paper addressed that. But only for certain variance and those
variance are not cautious. And he didn't give empirical results. We
don't think they would do as well without cautiousness added.
Haffari and Sarkar extend this analysis. And they use a bipartite
graph representation that we'll see in just a minute and we do have
empirical results but it's not cautious and therefore it does not do as
well as what we just saw.
So here's the analysis. We're going to look at two types of
distributions. This is the parameters. So it's a distribution for
each feature over the labels and that's the labeling distribution, with
the distribution for each example over the possible labels.
So the labels being the senses in the examples we saw. So the labeling
distribution is uniform with an example is currently unlabeled
otherwise it's a point distribution with mass on the label.
And the parameters are just the decision list score except we've
normalized them to be a distribution. So a decision list will make
this choice just taking the maximum scoring and feature from an
example.
And Abney introduces this alternate form where we take a sum instead.
It's not quite a decision list. But it is easier to analyze.
Okay. So switching topics slightly. Subramanya, et al. have an
algorithm more recently which they use on a part of this task. I
believe it's part of speech tagging, and it's domain adaptation thing.
So it's a slightly different task and slightly different algorithm. So
we're not concerned with the details of their algorithm, but a couple
of interesting properties.
And this is self-training versus CRF. So you can see it had the same
two steps relabel data and train but they've added extra steps, one to
get type level information and one to do draft propagation on top of
that.
The two things we're interested in here is first the overall structure
of adding of these steps and second the particular graph propagation
they use.
So our own contributions for this paper, number one, we have a Yarowsky
variant with a perturbation objective, which we'll see in a minute.
And this algorithm is cautious, and we can show that it performs well.
Unlike the previous well-analyzed Yarowsky algorithms, we show that it
is, that it can do as well as Yarowsky-cautious.
Second, we've unified all these different approaches, the different
algorithms. And third, more evidence that cautiousness is actually
important. So going back to the draft propagation, this is an
objective from Haffari and Sarkar for one of Abney's Yarowsky
algorithms.
It's an upper bound on the algorithm. And this is the objective for
the graph propagation of Subramanya. It's not the objective for the
whole algorithm we just saw, but just for the draft propagations.
And you can see in the first equation, this is the labeling
distribution. That's the parameters again. So if we compare these,
the first term of each is going to be the distance. If we plug those
distributions into this. It's going to be the distance between the
parameters the decision list scores and the current labeling.
And the second term is a slap or regularizer so they're quite similar.
So if we do plug those distributions into this, then we can directly
optimize that model, the bipartite graph model.
Alternatively, we cannot use that motivation for Haffari and Sarkar,
but we can just do graph propagation over the thetas, the parameters,
where we take co-occurrence in an example to be adjacency.
So it's not as well motivated to not work, but it sort of corresponds
to what Subramanya, et al. do.
So here's our own algorithm. It's like Yarowsky but it had this extra
step. We train a decision list exactly the way we saw with
cautiousness. And second we do propagation over that.
So we just take the parameters and we make one of those two graphs and
we propagate that. And Subramanya, et al. give updates to do that
propagation.
So we can either do it on the bipartite or unipartite graphs we saw or
couple more join the paper. If we use the bipartite one we only take
the parameters at the end, the decision list part of the model.
And so because we're doing this step, we know that at each iteration
we're optimizing the objective that we saw. And we can do cautiousness
simply by copying the decisions of this decision list, the original
decision list. So this one determines which examples we'll label and
this one determines what the labels actually are.
So to look at the behavior of this algorithm, we'll do the same
visualization. And so again the left bar is going to be the decision
list and the right, the labeled data. So you can see that like and
cautionness were forcing the decision list to grow linearly and balance
between the two labels.
And now we're labeling, because of the propagation we can label more
examples sooner. And the accuracy is increasing sooner. And comparing
to what we just saw for Yarowsky-cautious, we're actually doing a bit
better in this case. Again, they'll both change when we do a
retraining and a decision list, but the behavior, we increase coverage
and accuracy quicker.
We can also look at what happens to the objective. This is the
objective from Subramanya graph propagation and here we've disabled
cautionness, because by changing the input to the graph it's changing
the objective a lot on cautiousness on the bipartite graph, this is the
objective globally and it's decreasing and it kind of levels off at the
accuracy and the coverage of the loss.
So we have two sets of experiments. The first is following the running
example. So this is Eisner and Karakos's WordSense disambiguation data
from the Canadian Hansards, three words each of which is English and
not French and two senses and two seed rules.
The features are the words adjacent to the word we're looking at and
some nearby context words and we use them to [inaudible] raw forms of
each.
So this group of algorithms is the ones we've seen. This is Safari and
Sarkar's algorithm based upon the bipartite graph and that's a
different kind of graph propagation. And you can see it doesn't do
well. And this is our algorithms.
So in this case the cautious form of our algorithms, this is the
bipartite one and this is the unipartite one, and you can see the
unipartite one is doing pretty close to the L co-training cautious
beating cautious and the bipartite one is doing pretty well, too.
The data here is a little bit strange. The data sizes are small. The
second task is this named entity classification from con and singer and
in this case we have three labels, person location and organization.
Seven seed rules with some for each label and the features are spelling
features, which are from the phrase we're classifying, and context
features which will extracted from a parse tree of the sentence. So
particular words nearby in the tree and the relative position in the
tree.
And, again, the algorithms we've seen, this one and our algorithms and,
again, the theta only the unipartite one is coming out quite well and
the bipartite one is not that either.
In this case, our algorithm is the top. But we're not really trying to
show the algorithm is beating Yarowsky-cautious the co-trained here but
we're just trying to show that it's coming out equivalent in accuracy.
But as we said it doesn't have the disadvantage of requiring two views.
What I didn't say we're reporting on seeded accuracy here. So that
means we take accuracy only over the examples that the seed labels did
not label so the idea is to measure improvement over the label. That's
not what we were given.
And that's it.
you.
[applause]
So we have software online if you want to see.
Thank you very much. Sorry about that.
So we have time for a few questions.
Thank
Wait for the mic to come on.
>>: Great talk. I wasn't able to figure out if the algorithms are able
to once you label an example, later that example is to escape from the
labeling and go back to the unlabeled.
>> Max Whitney:
Yeah, the steps are --
>>: I saw the labels are always increasing.
>> Max Whitney: Yes but we relabel every label at iteration it can
become unlabeled. It doesn't happen very often but it is possible.
>>: Can you elaborate on the problem of requiring the two views so how
bad it is and in practice what ->> Max Whitney: I don't know how bad it is in practice. But my
understanding is that the theoretical requirements for those algorithms
are that the two views are statistically independent. And so you only
get the theoretical properties if your features have that property,
which is quite unlikely.
And so the idea is that we can get the same performance with a much
simpler algorithm would you the having to have that property.
>>: I mean, the performance is on a specific dataset that doesn't have
those properties, I guess.
>> Max Whitney: Well, on the datasets we've seen it's doing comparably
to the co-training algorithms. So we don't know that it's better but
it's doing comparably and we don't have to have the requirement on the
views.
>>: Also we had to try many different ways to split the features for
the WordSense data and we picked the best one. But we had to do a
search, like there was no natural -- nearby and far way is not
necessarily a natural split of features. So it's better -- the machine
learning technology should not have to think about it. You plug it in
and it should work. So why do this extra work if you don't have to.
So never do co-training and always do Yarowsky. [laughter].
>>: Okay. Then that's going to conclude our second oral session. [applause]
Download