>> Michael Gamon: Thanks, thanks for coming. I... even though we're only five minutes late. But we...

advertisement
>> Michael Gamon: Thanks, thanks for coming. I think we should get started,
even though we're only five minutes late. But we have a little bit of a different
format today. We have three talks. And they're all joint projects by University of
Washington interns and Microsoft Research mentors. So the interns are going to
present. And it's pretty packed. I mean we're going to try to stick to 20 minutes
each talk with five minutes questions, which should still give us a little bit of time
afterwards. But I mean I know what these guys have been working on, and
there's plenty more to talk about than just 20 minutes. So we'll see how we can
stick with the timeframe.
For those of you who don't know the event, this is like the 19th in a series. So
the next one is the big one, the 20th anniversary. And it's basically just a place to
connect research from the UW with research from MSR and sort of you know
make personal connections, meet people and sort of socialize a little bit and keep
the connection open between one side of the lake and the other.
Normally we have two talks, one from each UW and MSR, but again, you know,
this time we're experimenting a little bit, and depending on how it goes, you
know, if anybody has suggestions about, you know, different kinds of formats that
we could use, that would be also very welcome.
So for the three talks there's actually only one of the mentors could barely make
it. The other two are not here today. One is actually in Ottawa, accepted -- Colin
Cherry accepted a job at the national research council up there. So he's in our
group anymore.
The three interns are Stanley Kok, Hoifung and Al Ritter back there. The three
talks -- so Stanley is going to start with hitting the right paraphrases in good time.
It's joint work with Chris Brockett.
Then the second one is a really nice title, toward the Turing test, conversation
modeling using Twitter. That's Alan's talk and that's joint work with Colin Cherry.
And then Hoifung is going to talk about joint inference for knowledge extraction
from biomedical literature, and that was joint work with Lucy.
So again, you know, we'll try and stick to 20 minutes and then five minutes
questions.
>> Stanley Kok: I'll try my best. Hi, I'm Stanley Kok, from the University of
Washington. My talk is on hitting the right paraphrases in good time. So this is
the overview of my presentation. I'll begin with some motivation then I'll cover
some background, then I will go into detail about our system, hitting time
paraphraser. Then I'll describe some experiments and finally I'll end with some
future work.
So the goal in this project is to build a paraphrase system, a system that takes an
input a query phrase such as is on good terms with, and outputs a list of
paraphrases such as is friendly with, is a friend of.
Several applications can benefit from paraphrase system such as query
expansion, document summarization, natural language generation and so on.
So in the application of query expansion a search engine could receive a query
phrase from a user. Then a search engine provide the query phrase to a
paraphrase system and obtain a list of paraphrases which is then returned to the
search engine. The search engine could then retrieve documents that are
relevant not only to the original phrase but also relevant to the paraphrases,
there by proving the quality of its results.
I'd like to point out that our system as well as the systems that we compare
against uses and additional resource in the form of a bilingual parallel corpora.
So what's a bilingual parallel corpus? In such a corpus we have sentences from
two languages. Over here sentences English and German. So these sentences
are aligned. And the phrases in the sentences are also aligned.
For example under control is aligned with unter kontrolle. I'm not sure I'm
pronouncing the German right, so just bear with me. And over here in check is
also aligned with unter kontrolle. Now, from those sentences, from these
phrases, we can count the number of times that a phrase occur as well as the
number of times an English phrase co-occur is aligned with a German phrase.
This allow us to obtain phrase tables. So for example, under control appears
four times in this corpus and three of the four times under control is aligned with
unter kontrolle, then the probability of a German phrase given the English phrase
would be three quarters or .75.
Now, these phrase tables are used by our system as well as the systems that we
compare against. So in 2005, Bannard and Callison-Burch introduced the BCB
system. I named the system after the artist. So this is a paraphrase system and
it works as follows: It compute the probability that E2 is a paraphrase of E1 by
summing over all foreign phrases, German, say, are the product of E2 given G
and the product of -- and the probability of G given E1. Straightforward
approach. And if you have multiple corpora, simply sum over all the corpora.
So in 2008 Callison-Burch improve upon this similar by cleverly introducing
syntax. So if you look at the two equations used by the systems, you can see
that they are very similar. What SBP does is it introduces syntax here. It
constrains the paraphrases E2 to have the same syntax as the original phrase
E1. And this syntax information is obtained by a parse trees. The interesting
thing to note is that -- no, I mean that's it, use the leaves of the parse trees but
you also use the subtrees inside the parse tree.
Now, this is a very general high level description of SBP. For details I refer you
to paper.
Now, we can take a graphical view of these two approach by considering the
phrases as nodes. So here I got all the English phrases, German phrases,
French phrases. And [inaudible] exists between two nodes if there's a
corresponding entry in the phrase table.
Now, once you take the graphical view, we can see that the SBP approach -what the SBP approach actually doing is to sum up the probabilities along paths
of length two. So it's summing up the probabilities on this path and this path to
get the probability of E2 being a paraphrase of E1.
Now, once we see things graphically we can think of several ways that we can
improve upon this system. First, there's no reason why we should restrict
ourselves to paths of length two. If you consider paths of length four, the E tree
over here could very well be a good paraphrase of E1 by tracing this path of
length four. And E tree could be a good paraphrase of E1 if the probabilities
along this path are fairly high.
Second, note that this graph is a bipartite graph with English nodes on one side,
foreign languages nodes on the other. There's no reason for us to restrict ourself
to bipartite graph, kind of general graphs.
Lastly, we need not restrict nodes to represent phrases. This is a graph nodes
can represent anything. Specifically we could have nodes to represent the
domain knowledge.
So our system leverages these tree points to improve upon the state of the art.
And it does so using the notion of random walks at hitting time. So let me quickly
cover the background of random walks at hitting time. Now the term random
walk is self explanatory. It's just a walk by traversing randomly in a graph. I'm
just going to illustrate this using this simple graph over here.
Say we start at node one. Next we randomly pick a neighbor of node one
according to the transition probabilities. So there's a probability of moving from
node one to two, probability from one to three and one to four and all these
probability sum to one.
So in these probabilities we randomly pick one, say you pick two, then we move
to two and we repeat the process. So next step you could have moved to node
three. So a simple idea of random walks.
>>: You said you move to a neighborhood. How did you get from three to four?
>> Stanley Kok: So three to four is -- three and four, they are not neighbor
nodes, because there are no [inaudible] between them, so they are first to move
from three to one and then one to four. Yeah. Okay.
So what's the hitting time? The hitting time from node I to J is expected number
of steps starting from node I before node J is reached or hit for the first time. The
intuition here is that the smaller the hitting time the closer node J is to I. That is
the most similar node J is to I.
In 2007 Sarkar and Moore introduced the notion of truncated hitting time where
random walks are limited to a fixed number of steps. Let's say T steps.
In 2008, they improved upon their results by showing that you can compute the
hitting time efficiently and with high probability by sampling random walks. So
each sample is a random walk.
Now, let me show you one such sample. Again we start at node one. The box
over here will show you the order in which we traverse the nodes. And I limit the
random walk to a half T steps. To half five steps.
So let's say that we -- next we move on to node four and to five, then to four and
six and finally to five. So five steps. And these are hitting times. The hitting
times for node one to one is zero, because [inaudible] one. One to 4 is a one
because it just took one step to go from one to four. One to six has four steps
because it took one, two, three, four, four steps it reached six. And hitting time
for node two and three are set to a maximum of five because they are not
reached at all. So this is how Sarkar Moore defined truncated hitting time.
Now, how do we compute the hitting time from all these hidden samples? So as
the usual case we defined random variable XIJ. And XIJ is the first time node J
is hit starting from node I. And if a random walk never hits a node J, then XIJ
takes on the value of T. Then hitting time HIJ just an expected value of this
random variable. And we compute this expected valuable by taking the simple
mean, a straightforward approach.
Now, Sarkar and Moore showed that with high probability our sample mean
deviates from the true hitting time by a very small value if we are able to sample
at least this number of samples. So N is the number of nodes in the graph.
So that's the background. Let's move on to our system, the hitting time
paraphraser. Just a quick recap. Our system HTP takes this input, a query
phrase. Phrase tables could be more than one. These phrase tables could be
English, alignments of English to foreign languages as well as foreign language
to another foreign languages and outputs a list of paraphrases. And it ranks the
paraphrases in increasing order of hitting times.
As mentioned earlier, we could create a graph from these phrase tables. We
have nodes representing phrases and ages existing between nodes if there's a
corresponding entry in the phrase table. Now, this is fine if we have small phrase
tables. However, for [inaudible] is fairly huge. If you're going to do this, you're
going to end up with a very, very big graph, which is not tractable.
So what we do is well to start from the query node and then perform breadth-first
search up to a depth of D, up to a maximum of some number of nodes. In our
experiments we used a depth of 6 and maximum of 50,000 nodes.
Now I'd like to zoom into the nodes of the query for a read of this graph. So let's
do you might be asking a question of so how do we handle the transition
probabilities of ages that goes outside of this graph? Those ages that crosses
their periphery. We use a straightforward approach introducing a place-holder
node, collects the ages together and sums up the probability. And this place on
the node have a transition probability of .5 to the blue node and a self transition
probability of .5.
Now, this is a heuristic approach that works pretty well in as you'll see our
experiments. So now we are in the steps where we have created the graph from
breadth-first search. Now we could proceed to draw the samples, the random
walks, the run M truncated random walks to estimate a truncated hitting time of
every node in the graph. We limit the random walks to 10 steps and we drove
one million samples.
So with these numbers, the numbers of T and M, we could be 95 percent sure
that our estimate of hitting times is at least is no more than .03 away from the
true hitting times. So it's a pretty good estimate.
Once we have this random walk, I mean once you perform this -- once you have
drawn all the samples and estimated the hitting times we could rank these nodes.
And we could find those nodes with hitting times of T.
Now, these nodes are nodes which are pretty far away from the query nodes and
therefore they're not very similar to the query nodes and we could just throw
them away. This helps to prune the size of the nodes even further.
Now, once you have this prune graph, we can proceed to add more nodes to this
graph, to supplement the knowledge. We do so by adding what I call feature
nodes. There are three finds of feature nodes.
The first is Ngram node. So over here I snow a snippet of the prune graph.
Again the brown nodes are the English phrased nodes, and you have the foreign
languages nodes over here, blue and purple.
To avoid clutter, I'm just going to avoid -- I'm just going to remove the foreign
languages nodes, but just be reminded they are still there. So Ngram nodes are
pretty clear right here. So we have all these Ngram. We use one to four grams.
There's an H between reach and reach objective because this phrase contain
this unigram. And there's and H between achieve the and achieve the aim
because achieve the aim has -- contains this bigram. Why we introduce this
Ngram node so as to capture the intuition that if two phrases have lots of
unigrams, have lots of Ngrams in common, then it tends to be closer together,
tend to be more similar.
The next kind of nodes are syntax nodes. I put syntax in quotes because it didn't
really correspond to syntax as what you obtain in a parse tree. What we do is we
try to -- we classify certain words, things like articles, things like interrogatives
together into classes every -- determine -- we mark whether each phrase begins
with the clause. For example over here the object is linked to start with article
because it starts with the word the which is an article. Likewise for the aim is has
this link as well, because there are an article. Whose goal is and what goal is are
linked to the nodes start with interrogatives because whose and what belong to a
class of interrogatives.
So again these nodes capture the idea that if you start and end with the same
class of words, then you tend to be pretty similar. Like good candidates for
paraphrases.
Lastly, we have this not-substring-of nodes. Why do we have this? We notice in
our experimental results that lots of paraphrases, lots of bad paraphrases were
actually paraphrases which are substrings or superstrings of one another. We
found that things like reach the or frequently return as a paraphrase or reach the
objective just because it's a substring of the paraphrase. Now, clearly this is a
bad paraphrase.
So what we do is include and order here called not substring, not substring or not
superstring of. This node will have a link to the query phrase, assuming that this
is a query phrase, and all the other phrases will be link today this node if it is not
a substring or superstring of the query phrase.
So such nodes will be closer to the query phrase. So note there are four finds of
ages emanating from the English language phrase. I mention that we could also
have feature, similar features for foreign language nodes. But in our experiments
we did not do so. We could easily add those nodes as well.
But anyways for each English phrase nodes, there are four kind of ages, ages
that go to the regular foreign language phrase nodes and to the three types of
feature nodes. Now the question is how do we divide the probabilities as
transition probabilities of emanating from these English nodes? Because P1, P2,
P3 and P4 have the same total one. So we do so by tuning on a small set of
data. I just found that P1 gets appointed 1, P2 gets appointed 1, P4 and P3 gets
appointed four tends to work pretty well.
Some of you might be wondering why is P1 set to so low? Now, note that this
information has really been used when we first construct a graph and prune it
down. So we found that we need not give it too much [inaudible] anymore
because it's been used to pre-prune the graph to like good set of candidates.
So now we are at this stage where we have edit feature nodes. Once you edit
the feature nodes you can proceed again to run -- to draw M truncated random
walks estimate the hitting times of all the nodes, random nodes in increasing
nodes of hidden times and return them. Then the return nodes will be the
paraphrases of the query phrase.
How well does our system perform? Here comes experiments. We used a
Europarl dataset. These Europarl dataset is a dataset of minutes of European
union proceedings. These minutes are written in 11 languages, translations of
one another. We used six of these languages, English, Danish, German,
Spanish, Finnish and Dutch. There are about a million sentences per language.
And English sentence are aligned with the foreign languages. The
English-foreign phrase alignments are done by giza++, and this is done by
Callison-Burch. We used that data available from the web. But they do not have
the foreign-foreign phrase alignments. So these are done using the MSR, NLP
aligner by Chris Quirk [phonetic].
So the system that we compare against are SBP system. Again, this system's
available on the web. It's a Java implementation.
And we also did a study, investing how H testimony P does without the feature
nodes and how HTP does if it just works on the bipartite graph, rather than
general graph.
We also used in this dataset, in a NIST dataset was originally used for machine
translations. Well, he has four English sentences per Chinese sentence and all
together about 33,000 English translations. What we did was to randomly
sample 100 English phrases from the one to four grams in both the NIST and
Europarl datasets. So these four grams appear in both of these datasets.
We exclude things like stop words, numbers, and phrases containing periods and
commas. Now, for each of these randomly sample phrase, we randomly select a
sentence from NIST dataset that contains the phrase. Then we replace -- then
we substituted the to you one to top 10 paraphrase for that phrase.
So this means that we evaluate the correctness of the paraphrase in the context
of the sentence. Now, we did manual evaluation. We give it three scores, give
each paraphrase three scores. So a score of zero means that the paraphrase is
clearly wrong, it's grammatically incorrect or does not preserve meaning. A score
of two means it's totally correct, grammatically correct and meaning is preserved.
A score of one means somewhere in between with a minor grammatical errors,
things like subject but disagreement, the wrong tense, et cetera. The meaning's
largely preserved but not completely.
So we deem the paraphraser to be correct. It is given a score one and two and
wrong is given a score of zero. There are two evaluators and the inter annotator
agreement as affected by kappa is 0.62.
Now, this corresponds to substantial agreement between the two evaluators. So
let's look at the comparison between HTP and SBP in detail. First I'd like to point
out the SBP only written paraphrases or 49 of the query phrases. So let's focus
on these 49. Let's focus on the top one, the top paraphrase as written by this
system.
HTP got .7 one of these top one paraphrases correct and SBP .53. So suppose I
reconsider the top K where K is the number of paraphrases written by SBP. For
example, SBP only written three paraphrases so over here we look at the top
three for HTP. And performance .56 versus .39. So HTP is still doing better.
Now, how did HTP perform if you look at the top 10 paraphrases for HTP
system? We got a score of .54. So let's change our focus on the 51
paraphrases that SBP give us no results? How does HTP perform? For the top
one paraphrase we got a score of .50, lower than what we got for the 49 top
ones. And overall for the top one we got a score of .61. The corresponding
number for SBP is .53.
Now, let's look at the top 10 paraphrases for this bottom 51 queries. Got a score
of.32. And overall the system got during looking at top 10 paraphrases for all
queries we got a score of .43. The number of correct paraphrases is 420. For
SBP the corresponding score is .39 and 145. So there are lots of numbers going
around here. So the -- what you had should note is not only are we returning
more correct paraphrases, 420 versus 145, we are doing so with higher
precision, .43 and .39. Okay. I should hurry up. Okay.
So how does HTP compare to HTP-no features? That means suppose you don't
use the feature nodes. This is a simple comparison because both systems retain
the same number of paraphrases. You look at the top one paraphrase using
features that's better. If we look at the top K -- top 10 paraphrases again using
paraphrases that's better. So it's pretty clear that using features we do better.
Now, how about HTP versus -- HTP on a general graph versus HTP on a just
restricted to bipartite graph? So a bipartite graph does not return paraphrases
for five of the queries. So let's look at the five queries for which HTP did return
paraphrases. Looking at the top one, HTP got .62, and HTP-bipartite .58.
Looking at the top K, K is the number of paraphrases returned by the bipartite,
HTP bipartite, we did better, .46 versus .41.
Now, let's zoom into the five queries for which had HTP -- for which HTP bipartite
return no phrases. Now, these are the harder ones, so we got for top one we got
accuracy of .41 -- .4. And overall for top one got .61 and the corresponding
number is .584 HTP bipartite.
We look at the top K this figure is .2. And overall, over the top 10 paraphrases,
our system got .43. As before, 420 correct paraphrases. Corresponding number
for bipartite is .41 and 361 correct. Again, not only do we get more correct
paraphrases, we do so at higher precision.
How about timing? So HTP took 48 seconds per phrase, and the HTP-no
features, the bipartite was around the same ballpark. It's faster because it
doesn't use as much features. But SBP took about 468 seconds per phrase.
Now, note that when you look at these times it's good to remember that HTP -that HTP systems are implemented in C# and SBP in Java and they are really
using different kind of data structures. So just bear this in mind.
So future work. For future work, I'd like to apply HTP to foreign language
paraphrases rather than just English ones. I'd like to evaluate HTP's impact on
applications, for example want to see whether we can improve the performance
of resource-sparse machine translation systems by putting paraphrases into the
phrases tables by such systems.
I'd like to add more features, even features used by SBP. I'd like to add like four
syntactic trees to see whether they are bursty performance of our system.
So in conclusion I presented to you a paraphrase system based on random
walks. It uses the intuition that good paraphrases have smaller hitting times. It's
able to have a general graph, make use of paths of length larger than two, and it
makes it easy to incorporate domain knowledge. And from our experiments we
see that HTP outperforms the state of the art.
[applause].
>>: So if I understood right, you the [inaudible] of your work [inaudible] then you
added features then did another random walk.
>> Stanley Kok: Right.
>>: So why didn't you add the features right at the beginning?
>> Stanley Kok: Because [inaudible].
>>: Okay. So it's just an efficiency point?
>> Stanley Kok: Yes.
>>: [inaudible] your proposal will [inaudible] languages?
>> Stanley Kok: Indo European languages.
>>: These are Indo European languages.
>> Stanley Kok: But these aren't European languages.
>>: Finnish was in there somewhere.
>>: I'm curious how much the four [inaudible] of English. So when you translate
a language it will encourage you to try something like Finnish early on because
it's got much more of a [inaudible] complexity and [inaudible] that you're not
seeing in English?
>> Stanley Kok: The reason I have such like put in Finnish there and we look at
languages I chose it such that it's a good spread of languages rather than the
languages are really similar. Finnish was quite different from [inaudible].
>>: English and German and Dutch are all very similar so it's good to have
Finnish, but when you said you wanted to do, try this on foreign language instead
of just English, Finnish would be a good place to start. But Finnish especially is
[inaudible].
>>: So your results were based on whether or not your paraphrases are a score
of one or two, so it's whether they are correct or they're acceptable. Do you ->> Stanley Kok: Correct or totally wrong. Yeah, that's right. Zero point.
>>: But you count it as a positive paraphrase. If it was either totally correct, or if
it was, you know, there's some more [inaudible] mostly okay. Do you roughly
remember what the numbers were if you just looked at the ones that were
completely correct?
>> Stanley Kok: I'm afraid not. But I definitely could put that up.
>>: The criteria for really close were things like if we inserted -- we didn't do any
modification so if it was an interesting time and the institution was wonderful time
for interesting time then we still have [inaudible] so the sort of thing that would be
corrected fairly easily in the word grammar check. So in order to say we're
getting very close, we might have to do some manipulation here.
>>: Specifically because of that, because you want to include that into a phrase
table, so if you're getting .6 or something, it still means that two-thirds of what
you'll be putting in a phrase table is wrong, one-third of what you're putting in is
noisy, right? So you would only want to put it in -- you want to add into your
phrase table paraphrases that are actually correct, right, or as correct as
possible. You want to try to like minimize how much work you have to do.
>>: Right. The issue is when you have a phrase table where are you applying
it? The particular instance that we had at that particular point might be had an
interesting time with the -- we can't predict that we're going to have -- we might
have something like instead of had N we might have something like we have a
substitution there. Thank you for the, thank you. Thank you for the suggestion,
yes. So thank you for the interesting time. The wonderful time. You might have
others being -- may not be necessarily [inaudible]. We don't know where the
institution direction is going to be. So [inaudible] put them into the phrase tables
in the particular environment where we tested them they may be close but just a
little off.
>> Stanley Kok: So want my to address that issue of having the one [inaudible]
noise the values would be because use the hitting time metric. Just say that well,
if your hitting time is smaller than this, then we keep it, otherwise you throw it
away, otherwise you are bad. So that would allow us to get some threshold
because you could put out all the so called bad paraphrases.
>>: Well, the fact that [inaudible] two-thirds of what you decide is a paraphrases,
two-thirds are good using this straightforward method.
>> Stanley Kok: So this -- I did not have a threshold there, so I just returned I
said top one, looking at the top one. Yeah.
>>: So how to you decide when [inaudible] and what's the threshold?
>> Stanley Kok: If you're going to use the threshold the hitting time to choose
then we have some -- okay. One is obviously you have some manual evaluation
to look at. It is a good threshold to evaluate. The second would be a better one
where we have an application to tell us.
>>: [inaudible] your experiments.
>> Stanley Kok: Over here these are thresholds. I said just look at the top one,
the top one score and up to the top 10 score.
>>: Okay. But some of your [inaudible] have more than 10 ->> Stanley Kok: More than 10 but only consider up to the top 10.
Well, thank you very much.
[applause].
>> Michael Gamon: Okay. So from large graphs we're going to really, really
small text now to the world of what is it 140 characters or less. But many of
them. So and we're going to be introduced to the Turing test which is something
really new. So Alan is going to take it.
>> Alan Ritter: Okay. Thanks, Michael. Okay. So I'm going to talk about
modeling conversations on Twitter. And this is joint work that I did this summer
with Colin Cherry and Bill Dolan.
Okay. So first some motivation. So why would we want to model the way that
people have conversations? So I think that this sort of main motivation for this is
if we want to build conversational agents that are able to go sort of beyond just
scripted dialogues and kind of, you know, say things more impromptu and have
more of a conversation with the user.
And also if we want to build chat-bots that are able to do more than just parrot
back what you said to them in a slightly differently phrased way. And some other
possible applications for this include sort of like social network analysis, better
Web Search for conversational text, things like forms and blogs and what not.
And question-answer pair extraction.
Okay. So what does Twitter have to do with modeling conversation? So most of
the conversations on Twitter look something like this. It's just someone
broadcasting information about their life to all their followers and probably nobody
cares. But in this -- this is sort of not conversational, but somewhere between 10
to 20 percent of the posts on Twitter are actually in apply to another post. And
these form sort of short conversations.
So I've just shown a short example here. So this user said you know, I'm going
to the beach this weekend, I'll be there until Tuesday, life is good. Somewhere
else responded to them saying enjoy the beach, hope you have great weather.
And then they responded by saying thank you. And that ends the conversation.
And this is sort of a very typical kind of conversation that we find on Twitter.
So we gathered a whole bunch of conversations from Twitter using their public
API. And so we did this basically by watching the public timeline which gives us
20 randomly selected Twitter posts per minute and we use that just to get a
random seed of Twitter users. And from there we can kind of crawl all the users
that they're following and crawl their followers and so on, and basically query to
get all the posts of each of the users that we have.
And so if any of those posts are in reply to another post, then we can kind of
follow that back to collect the conversation. And so we -- you know, we got a lot
of conversations. We got about a million conversations of length two, 200,000 of
length three, 100,000 of length four and so on. And you can see that the
frequency seems to drop off very quickly with the length of conversation.
And in fact, if you plot frequency versus length on a log log scale, it looks pretty
much linear, and this sort of indicates that it's likely to be a power law, which I
think is kind of interesting.
So our basic hypothesis is that there's some kind of latent structure behind how
people have conversations. So basically, you know, each post can be classified
into a class which describes its purpose in the conversation. So you might see
like an initial status post followed by a comment and then sort of followed by a
thank you. And this is sort of the kind of thing we want to model.
And so further we're saying that there's sort of a Markov property in that each
type of post depends only on the previous one and not on anything else. And
then the words in the -- in each sentence depend on the type of post or the
dialogue act I guess as I'm going to call it.
So in an initial status post the word I is going to have high probability, and then
we're going to see a lot of verbs like going because the person is sort of
describing what they're doing, whereas in a comment you might see you with
very high probability as opposed to I and then words like enjoy and hope and so
on.
But then there's also these other words that sort of don't have anything to do with
the dialogue act but are just sort before the topic of the conversation. And so in
this example we see words like beach, weekend, Tuesday, weather, and so on,
and so these don't indicate the dialogue act but they're just kind of sprinkled
throughout. And they're all sort of semantically related to each other.
Okay. So in order to model these dialogue acts and the transitions we looked at
this -- these content models from the summarization literature, and they are used
specifically for a multi-document summarization on news articles. And so
basically they are looking at very specific genres of news articles, for example
articles all about earthquakes, and they found that, you know, typically people will
sort of in the first sentence describe you know the location and the magnitude of
the quake, and then maybe in the second sentence they'll tend to describe, you
know, how many people were killed or something like this.
So they have sort of a very predictable structure. And they model that using
hidden Markov models basically in which the states emit whole sentences and,
you know, so they have a language model for each state that's sort of the
emission model for the HMM. And then they learn the parameters using
Viterbi-EM.
Okay. And so here I've just shown sort of a graphical model representation of
what they did. And so this is basically just a Bayesian network where each
dialogue act depends on the previous one and then it emits a bunch of words.
And so these boxes are the sort of plate notation which represents replication.
So each act is generating a bunch of words independently. So you can read this
box as like a sentence basically.
Okay. So the -- you know, this looks like really nice for what we want to do in
terms of modeling dialogue acts and Twitter conversations. Because it models
this transition between latent structure which corresponds I think very well to the
idea of what dialogue acts are.
And in addition it's totally unsupervised. And so we don't need any training data
other, you know, label data in order to be able to do this, which is really
attractive.
So there's a couple problems with this, however. And so if you just apply this
naively without doing any kind of abstraction, it tends to group together
sentences that belong to the same topic instead of the same dialogue act. And
so in order to deal with this problem, Barzilay and Lee masked out all the named
entities in the text. But this is a bit more of a problem for us because we can't
rely on capitalization to do a good job of identifying named entities in Twitter.
And so if you don't sort of deal with this, you'll get things like all the sentences
involving Seattle will sort of get clustered into the same sentence. Or same
cluster, excuse me.
So what we really want are clusters like, you know, status, comment, question,
response, thank you and not clusters about like iPod, sandwiches and vacations.
And so we ran the content modeling approach on our Twitter data and we found
-- this is sort of one example of one of the clusters we got. And you can see here
these sentences don't really have anything to do with each other in terms of
dialogue act, but they're really very much on the same topic. They're all sort of
about web browsers and operating systems.
So you see words like Safari, Windows, iPhone, Chrome, Firefox and this kind of
thing. So this is -- seems sort of problematic for what we want to do.
So our goal basically became to separate out the vocabulary into words which
indicate the dialogue act and then conversation specific topic words. And in
order to do this, we used an LDA-style topic model where basically each word is
going to be generated from one of three different sources. So it could be just
general English. And this is just sort of a way to flexibly model stop words. It
could be specific to the conversations vocabulary. So these are like topics. Or it
could come from the dialogue act. And hopefully, you know, we'll get sort of a
stronger signal here.
And this is similar to the model that Lucy and Aria Haghighi used in this year's
[inaudible], and this was also for summarization.
Okay. So here's this graphical model representation that I showed earlier from
the Barzilay and Lee paper. And basically what we're proposing to add to this is
the following: So we've added a hidden state here for each word which
determines which source it was drawn from. So it could either come from the
dialogue act, it could come from the topic of the conversation, or it could just
come from general English.
And so in order to do for instance in this model, we used collapsed Gibbs
sampling, which is just the standard technique for this the kind of thing, where we
sample each variable in turn conditioned on the assignment to all the other
variables integrating out the parameters. But we found that there's a lot of
different hyper parameters that we have to set. And these things actually really
affect the results of the output.
So to solve this problem -- well, we tried doing a giant grid search at first, but this
was just kind of infeasible, so we used this idea of sampling the hyper
parameters where we just treat them like other random variables and just sample
them in the Gibbs sampling procedure. And we use the slice sampling approach,
which we found to work pretty well.
So another big problem is probability estimation. And so this is basically the
problem of estimating the partition function which is intractable in these kinds of
models in general. So we need to use some kind of approximation. And I'm not
going to go into a lot of detail on this, but we use used this Chibb style estimator,
which has been proposed recently in the literature for this kind of thing. And we
found that it worked pretty well.
Okay. So we did a qualitative evaluation. We trained our model on 10,000
Twitter conversations and then sort of looked at the dialogue acts it generated to
try and see if they make sense.
And so what I've shown here is a visualization of the transition matrix between
dialogue acts. So each numbered box represents a dialogue act. And an arrow
was drawn between to have acts if the probability of that transition is higher than
some threshold. And I believe I set it to like .15 here, and there's about 10 acts,
so it's sort of like higher than random chance.
Okay. And so now in order to visualize each of the acts, I've shown a word cloud
where the size of the word is proportional to its probability of being generated by
that act. And so you can see that this act in particular is sort of a starting state, it
sort of starts a conversation, so it's transitioned to you from the start stage, and it
has the word I with very high probability. And then you'll also notice that it has a
lot of words like today, going, getting, these kinds of verbs that describe what
someone's doing. And you'll see a lot of words like tonight, morning, night,
today, tomorrow, a lot of sort of time words.
So this is just sort of like this typical kind of post on Twitter where someone's just
describing what they're currently doing. And this sort of I think makes a lot of
sense.
So this one can be transitioned to from any of the initial starting posts, and it's
pretty clearly a question. The question mark has super high probability. And
then you also see that you has very high probability here where in the previous
one I had high probability but you don't see you. And then you see things like
what are your, et cetera.
And then this act is also a question, but this is sort of an initial question that's
starting a conversation. And so this one you'll see a lot of things as the last
question state, but there's also some words that we didn't see, so you'll see
things like know, anyone, why, who. And from looking at some of the examples,
it seems like this is really someone just asking a question to their followers and
it's typically just asking a opinion about something.
And then this one is a little bit different. So in just the preprocessing of the
corpus I replaced all the user names with this user tack and all the URLs with the
URL tag. So this one contains a lot of user names and URLs. And it also has
this word RT, which has sort of special significance in Twitter. It stands for
retweet, which is like reposting someone else's post that you found interesting.
So this is sort of like broadcasting some interesting information you found.
And sort of in response to this, we see this sort of reacts state where the
explanation point has super high probability, and you see things like haha, LOL,
thanks. You can just imagine sort of like reacting to some funny link someone
sent you.
Okay. So we also did a quantitative evaluation where we try and measure how
well our model with predict the order of the sentences in a conversation. And so
in order to do this we had a test set of 1,000 conversations. And for each one of
those, we generated all possible permutations, all sort of N factorial permutations
of that evaluation and evaluated its probability under our model. And then we
picked the one with highest probability and compared that order into the original
sentence order.
And so in order to compare orderings, we used this Kendall Tau metric, which is
basically a correlation between rankings. So if the two rankings are exactly the
same, the value will be one. If they're exactly opposite, the value will be minus
one. And anything greater than zero is like a positive correlation.
Okay. And so here I'm just showing some experimental results. And so this is
from the content modeling approach from Barzilay and Lee. And these are all
trained using EM and raising various levels of abstraction. So we tried using
[inaudible] tags and then word clusters but it turned out that just using the raw
word unigrams worked best in terms of recovering sentence order.
And then our content model -- excuse me, our conversation model with the topics
clustered does a little bit better, it gives a sort of a nice boost. However, we
found that when we just leave the topics out and just use the fully Bayesian
version of the content model it tends to do a little bit better.
And I tried running sort of a whole lot of experiments to try and sort of get the
topics working, and I mean they beat the EM version, but we weren't able to get
them to beat the just fully Bayesian content model.
And so why is it that the fully Bayesian content model is winning? So it could be
outperforming the version with the topics for a couple of different reasons. So for
one thing, it has fewer parameters and hidden variables to set. So this could just
be a problem with getting stuck in a local maximum.
In addition it could be that modeling the topics transitions is actual important in
predicting sentence order. So even though it doesn't give us a really nice looking
dialogue act, it could be just important for that task.
And so in terms of why it's outperforming EM, I don't think this is super
controversial. There's been a couple other papers that show that this full
Bayesian inference is better. And there's a couple, you know, possible
explanations for that. So we're integrating out the parameters. We're not just
using a maximum like we hit estimate. And we're also estimating the hyper
parameters from data. And this gave us a little boost, but I don't think it's sort of
explaining all of the difference. But I think it's -- probably most of it is due to the
sparse prior.
So and EM -- and the standard EM, you can't sort of use the sparse prior,
whereas with the Gibbs sampling you can. And actually the hyper parameters
that we found were sparse, so it was actually using it.
Okay. So in conclusion, we've presented a corpus of Twitter conversations
which we're actually planning I believe to make publically available now. And we
showed that they have some sort of common structure which looks like it would
be sort of interesting to exploit for applications. And we showed that the Gibbs
sampling and full Bayesian inference seems to outperform EM. Thanks.
[applause].
>> Alan Ritter: Yes?
>>: I don't tweet much. Can you tell us again how you identify what constitute a
conversation and rebound that?
>> Alan Ritter: Right. So basically when you hit the reply button, it sort of stores
that information. So just in the API it has like sort of like if the tweet's in reply to
something, it will have the ID of the post that it's in reply to. And then you can
just query for that, you know. And then from that one, if it's in reply, query sort of
chain it back, you know. Does this make sense?
>>: It does. But I think you may have -- you may have identified conversations
that really aren't --
>> Alan Ritter: So like how so?
>>: Tonight I say hey you want to get dinner and I tweet back tomorrow sorry I
missed your message. Those are related. Those kind of conversation, even
though it spans a day.
>> Alan Ritter: Right.
>>: The following day I text you back something totally unrelated.
>> Alan Ritter: Yeah, you're probably right. That probably does happen in our
dataset I think most of them are actually real conversations. There are a lot of
cases where we sort of miss part of the conversation like it -- they hit -- forget to
hit the reply button or something and we get sort of these two parts of it that are
separate.
So there's issues like that with the data definitely. But for the most part I think it
looks pretty good.
>>: I don't tweet, I don't follow Twitter, but it seems to me you have one person
posting something and then several different people replying so that's not one
coherent conversation.
>> Alan Ritter: Right. So when it's sort of a conversation tree.
>>: Yeah.
>> Alan Ritter: Right. So we tried to -- like we didn't -- we tried to explicitly
federal out the multiparty conversations just to try and keep the data clean. But,
yeah, you could have sort of multiple conversations that began with the same
tweet. And, yeah, that is kind of -- I don't know, I'm not sure the best way to
handle that. But, yeah, that's a good point. Yeah. Uh-huh?
>>: So you found -- so you've had ways to go up the conversations, but is there
any way to go down the conversation?
>> Alan Ritter: No, not that I could figure out at least. So that is kind of a
problem. So we might not see the very end, you know. So we're -- I mean, even
if you think about it just going in time, if we pick something that happened right
now, we don't know that the person opportunity going to reply. So I think that's
always going to be just a problem. Yeah?
>>: You mentioned that you were looking for trying to pull out things like status
and common questions and stuff like that as the categories. Did you come up
with those ideas a priori, or were those based on examining what the model
found?
>> Alan Ritter: Yes. So those were just sort of my interpretations of what, you
know, the clusters looked like basically. You know what I mean? So there's no
labeled data. This is all totally unsupervised, and it just sort of found those
clusters just by itself, you know.
Yes? Or I think you were first.
>>: I'm sort of wondering here how you're doing your own -- what -- I guess two
questions. So tweets and Twitter seems like a very small domain actually, and
this particular problem you're looking at is very similar to like e-mail threads or
like Facebook conversations. And there's a whole genre there. Have you
thought about extending it or expanding it or trying to relate it some kind of way
to other similar kinds of electronic communication -- I don't even know what to
call it. Genre is as good a word.
>> Alan Ritter: Yes. So there's a lot of things like IRC chat and Facebook and
so on. I think Twitter is good for this because it's the default setting on people's
accounts is open, totally public so you can go crawl it. Like Facebook the default
is private.
>>: Okay.
>> Alan Ritter: It's sort of harder. And I think people have looked at e-mail and
stuff, too. Yeah.
Was there a question?
>>: Well, you kind of answered. It was leading up to my question. He kind of
went to it and you half answered it. I was going to answer why you chose
Facebook -- why you chose Twitter instead of Facebook. Was it because of -and I'm also thinking because Facebook would have images and video and ->> Alan Ritter: Yeah. But Twitter actually.
>>: Twitter's all text?
>> Alan Ritter: Twitter actually does have images, too. There's like this tweet
pick or something like that. People will put links to pictures, you know, and be
like haha, look at this, you know. And it's -- I mean you need some vision
algorithms or something to handle that. I don't know. Yes, Peter?
>>: [inaudible] in your transition graph. So do we have to choose a threshold
[inaudible]? How do you choose that [inaudible].
>> Alan Ritter: You mean picking the number of clusters? Is this your question?
Yeah, so yes, we picked 10 for that, mostly just because it's easy to visualize,
you know what I mean? So I think the best performing ones I think were about
20 clusters, if I recall. So we did vary the number of clusters and look at that.
Yeah.
>>: So when you parsed those it was all strictly and straight up like the
[inaudible] in the new stream or [inaudible] user click in directed at someone in
the middle of the message? Did you catch that too or ->> Alan Ritter: So I think your question is about user names. I'm not sure I quite
got it.
>>: So when you caught your -- when you caught the [inaudible] to start a
conversation simply because the start -- you catch the user name at the
beginning or was it like through -- or what determined ->> Alan Ritter: Right. I think I see what you're saying. So right. So there's a
convention that if you put the user name at the beginning of your post, you're
directing not at that person. And so that wasn't actually -- we didn't use that to
get this data. So what we did is if you hit the reply button and Twitter actually
records that, it's like this post was in reply to this other post, you know what I
mean?
So trying to recover the conversations you just using the user name I think might
be a bit difficult. I don't think it would be as -- because you don't know sort of like
which -- you don't know really which post it was in response to, you know what I
mean? Yeah?
>>: Well, I was just thinking that with the user names, if I remember correctly,
Twitter -- if you look at somebody's profile you could see all the people that follow
them?
>> Alan Ritter: Right.
>>: So it seems that if you're looking at -- if you're looking at words that happen
in a particular post and grouping them by that, you might be able to look at a
particular user and their followers and recreate conversations or reconnect
conversations that weren't using the reply.
>> Alan Ritter: Right. Yes. So I think it's a little bit more. Because you're never
going to know exactly which post it was in reply to. So I mean one user could
say a bunch of things, and then another user could say a bunch, you know, and
sort of -- I mean, I think it would be sort of an interesting problem to try and do
that. I think you'd need some sort of non deterministic thing to figure it out, you
know.
>>: I'm just thinking that, yeah, yes, one user might say a bunch of things which
a different user responds only one thing but that's how conversations tending to,
one person is going on and on and on, especially given the artificial length limit
on Twitter.
>> Alan Ritter: Right. That's a good point. Yeah, yeah. And yeah, I mean
there's work people have done on IRC chat logs in trying to distangle
conversations, you know, like there's multiple threads going on at the same time.
I think that's -- would be interesting to address. Yes?
>>: Part of your presentation was an example of styles of conversations. Kind of
following what he had just [inaudible], is there a way to take a part of
conversation which may have built from a followers and find out if it's a particular
stage in a set of conversations?
>> Alan Ritter: Yeah, that's a good question. Yeah, we didn't look at that at all,
but, yeah, I mean you could take sort of fragments of conversations from different
users and sort of like try and see, you know, if are they connected in any way.
We haven't looked at that, but I think that would be an interesting thing to do.
Uh-huh?
>>: Perhaps you could address this in your presentation and you have
[inaudible] that in terms of the structures in conversations in Twitter, what -would you be able to -- how could that help -- let me ask -- in terms of
technologies and stuff and the structures themselves, what can the -- what will
that be able to do for I guess future technologies?
>> Alan Ritter: You're just asking about like applications?
>>: Yes, application.
>> Alan Ritter: Yeah, right. I gave sort of a brief motivation for this at the
beginning. I think sort of like I kind of mentioned the idea of sort of like building a
conversational agent, you know, I think is sort of like a big -- was sort of the
driving thing for us at the beginning of this project. Yeah?
>>: Well, what about search? I mean, you think that could play a role in the
search and stuff, better to build better search technologies?
>> Alan Ritter: Right. Right. So I think, yeah, if you're trying to like search facts
or like message boards or something like that, I think something like this might be
helpful. Like if you're looking for things that are an answer to a question. I mean
honestly I haven't spent like a whole lot of time thinking about this. But I think I
mean it seems like you should be able to help with this kind of information
somehow. Like how people are searching conversational text, you know. Yes?
>>: I think there's added value from considering the timestamps on the tweets
like certain types of conversations might be faster or slower?
>> Alan Ritter: Right. Right. Yes. I think I noticed most of them seem to take
place within the time span of like 20 minutes or something like that. And, yeah, I
think that's a great thing to look at. It could somehow indicate some important
information. I can't think of off the top of my head what, but, yeah, I mean
another thing too, a lot of these -- sort of like happening on mobile devices. So
maybe the really quick conversations are more likely to be on mobile than on the
web interface or something like that. But, yeah.
>>: Going to the time, did you notice that -- say there's a [inaudible] say at two
a.m. versus two p.m. [laughter] was there --
>> Alan Ritter: Right. Yeah. I mean, there's definitely the frequency of tweets
definitely varies a lot by time. So at night there's probably -- I think there's fewer
than doing the -- you know, normal hours in North America or something like that.
But, yeah, I don't know about the types of conversations. I didn't look at that.
That would be an interesting thing to look at. I don't know. It's a good idea.
>>: This is [inaudible]. What did you use to make the [inaudible]. Did you do it
by hand?
>> Alan Ritter: No, no, I used -- there was should sort of like web interface with a
Java applet called Wordle I think it was called.
>>: Thanks.
>> Alan Ritter: Okay.
[applause].
>> Michael Gamon: Okay. So now we're going to go from the short tweets to
the long protein names and long concepts and biomedical. So this is Hoifung
Poon and he's done some work with Lucy Vanderwende on joint inference for
knowledge extraction from biomedical literature.
>> Hoifung Poon: Okay. Thanks, Michael for the introduction. So this summer I
have the great pleasure working with Lucy on this exciting project of joint
inference for bio-event extraction. For those who have been to my intern exit
talk, it's pretty much the same talk with compression ratio of three.
I will start with some motivation and then I will talk about the task of bio-event
extraction and finally I will present our system and some preliminary results.
So before we dive into this bio-event extraction, let's step back and take a look at
a bigger picture. So in the past decade or so with the invent of the worldwide
web there emerge a great vision of knowledge extraction from the web. And the
idea is that to go from the unstructured or semi structure text available from the -online and turn them, extract -- structure knowledge from them.
Such a great vision doesn't really need much advertisement beyond yourself.
Apparently we can extract knowledge automatically and reliably even within a
limited extent. We construct a gigantic knowledge base, probably the largest in
the world, and then this can facilitate all kinds of great application like semantic
search, question answering and so forth. And final by breaching the knowledge
acquisition bottleneck the days towards solving AI could be numbered.
So in the past decade or so, so this is a great vision and there has been some
very great efforts including some local system like very notable system like text
runner and my net, however the problem still remain largely unsolved. And the
natural question asked is should we start looking at tackling the whole web as a
whole from day one or is there a domain that we should start priority on?
Presumably such a domain either would be more urgent to tackle or maybe it's
easier to make progress on. And also the lessons learned and also the
infrastructure for this domain should be general so that we can use it for starting
point to tackling the rest of the web.
So I will argue that the biomedical literature is such a gray starting point for
knowledge extraction from the web. So online from PubMed there are about 18
million abstracts right now. And this is the -- and then the growth is exponential.
A few years back, the reporting number I saw is about 14 million. So you can do
the math.
And the success, if we can do -- have some success here that will have a huge
impact on biomedical research. So when you do a biomedical research, you
have to have access to a broad spectrum of information. Like if you investigating
a disease of diabetes or AIDS, the relevant gene, the number of relevant gene
might be in hundreds, even thousands. So you really want to know all this
relevant information.
And on the other side, each gene under the sun might have already been
investigated by some lab somewhere sometime. However, those information is
captured, maybe written up in a nice paper but it's buried under these millions of
articles and it's very difficult to find them, find them out for researchers.
So one biomedical research student in -- at UW actually told me that they
couldn't really keep up with all the literature in their own self view, and what they
do is they really just follow a few top labs and just pay attention to what they do.
So you can imagine how much effort research effort has been wasted because of
this and also how much -- how many discovery opportunity has been missed
since the result.
And also, needless to say, if we can make progress in this direction, there is also
flip side on the commercial world that this will bring a huge impact to the drug
design and also the big pharmaceutical company will really love it. And finally
this domain is attractive for this especially for this audience because they are all
written in grammatical English. So from a particular previous talk, you have
already seen some of the English may not be -- some of the language there may
not be so amenable to processing. However, in this domain, you're supposed to
write in grammatical English, and so you can -- and mostly in English, and often
in English and you can apply any of the advanced linguistic theory or NLP2 to
this domain.
So hopefully I have convinced you that this is a worthy endeavor. However -and now the natural question to ask then is why is it -- this problem hard?
So here to get a feeling about a domain, here is a typical snippet from just from
one sentence in one abstract. So this is not even a full sentence. So you can
see this few number of characters actually [inaudible] a great number of
knowledge there. So it actually describe a bunch of events marked in red here
and also describe a bunch of protein and genes marked in green here and also
some of the localization sign marked in blue. And supposedly this actually
convey a complex nested structure, so you have an up-regulation event, you
have another regulation event signified by the involvement and yet another one
by activation and then they also have event argument structure like the IL-10 is
the theme of the regulation event, and also the site and the cause and so forth.
So again, imagine using techniques like keyword search or pattern match is
really -- couldn't really scratch even the surface. So this is the first major reason
why the information is so hard to get. Another major reason which is sort of
really common to NLP is that you can say the same thing but in many, many
different ways. So here is what's called a trigger word that sort of signify the
event for negative regulation. And just within 1,000 documents, 1,000 abstracts
from the PubMed, so you can see the numbers are basically number of times
that following word are used to signify the event. And you can see this clearly.
On the flip side of this, the same word can also mean very different events. So
for example in this corpus of 1,000 abstracts, the word appearance is actually
used to signify five different event types, so including some of the examples here.
And also there was some very subtle denotation. Like for example in this
sentence, you can see the first line, the first line these cells are deficient in some
expression, although their production is normal.
So normal here actually signify regulation event, but if you just look at the second
part of the clause, just look at this subclause, some production is normal, you
really get a clue of the -- of an event only when you consider that in terms of the
context that there is a parallel event that's described earlier then this is actually
referring to the same event but talking about this event doesn't happen in this
context.
So by now you have seen that why this problem is hard. And in fact, this is
actually a great opportunity for linguists and NLP researchers. And exactly
because it's because all this complication involve in the human languages. So
bio-NLP is actually a emerging field. It has -- it has -- right now it has regular
work shop in [inaudible] ACL and I won't be surprised if it will become a full blown
conference very soon.
So initially the communities start on tackling the -- just trying to recognize the
protein name. And this problem is logically solved. We can get to high 80s. And
then they venture into trying to detect whether two proteins interact with each
other. So this has been gone on for decade or so, but still the top one is around
60. It's largely unsolved. And it's actually mostly because when people work on
this problem, they recognize that in fact instead of working just on treating it as a
binary classification using [inaudible] it's actually necessary to go deeper into the
language structure and actually tackling, extracting the detail bio-event. So this
is the shared task of this year. And obviously the story doesn't end here. There
has been already effort undergo -- going on to construct a pathway corpus and
then eventually you want to build entire gene network and understand how they
interact and so forth.
So in this talk we will folks on this bio-event extraction. So by definition of
bio-event refer to state change or bio-molecules. So in this shared task they
focus on these nine event types. So the first five event types are relatively
simple because they can only take one theme. The binding is a lot more
complicated. It can take up to five themes so you can have A binds to B and C
and D, so forth.
And then the regulation and they distinguish regulation into positive, negative,
and also ambiguous, which you don't know which way to go. And this is the most
complex as we see in our example.
So the data given to the shared task is that you are given the text obviously and
then you are also -- because protein recognition is the problem that is not really
solved, we want to factor it out if the task, so they also label, give you where are
the proteins. And then your system is supposed to predict this block of
information. So each line either referred to sort of an event type declaration or
an actual theme or cause.
So the first two lines talk about that there is a regulation event signified by the
involvement, there is a positive regulation signified by activation and then the last
two in the line actually talk about so E1 is an event and you take the theme of E2
and cause by T3. So here you can see that a regulation event can take another
event as argument.
So you don't have to look into the number. The whole point for this slide is that
this shared task has attracted a great number of participants, and as a result the
number are pretty -- the performance are pretty representative of the current
state of the art.
So the top system is about a group from University of Turku from Finland. So on
the simple event, the first five events, they get an F1 of about 70, which is pretty
decent. However, on the binding and on the regulation event, those are the
really difficult ones. And overall, the F1 is about 52 and the precision about 58.
So both are still not very satisfying.
And then the top system also have another problem at a high level is that they
adopt a pipeline architecture. Actually the top three systems all adopt a pipeline
architecture. So they typically first determine a number of candidates events and
types. And then starting with this list of candidates they then clarify for each pair
of these candidate and a protein are tween the two candidates whether the later
is a theme or cause of the former one.
So and then to prove system use SVM for classification. So a major drawback
for this approach is that there is no way to feed back information once you have
committed to the list of candidate. And the Turku system actually has to do some
very ad hoc engineering to facilitate the second stage learning. And also another
major draw back is that the prediction of each theme or cause, also each event
and their types is make totally independent of each other. So you actually lost a
lot of opportunity to inform each -- the decision.
As another interesting data point is the Tokyo team, they used a conditional
random field and they are actually from the organizer and they get an F1 of 37.
So the bottom line is this task is very challenging, indeed reflecting how
complicated is human language.
So when we come to decide our system, our first decision made is that we -even though the top system all user the pipeline architect, we think that that's not
satisfying and that's not good for pushing the performance towards the next level,
so we think the first criteria is to jointly predict events and arguments. And also
we want -- there are certain prior knowledge that we know and we want to
incorporate straightforwardly. And more when we saw that the random field
doesn't cut it because when -- from the inside it's really always because the label
doesn't really correlate in terms of linear context. But they are probably likely so
in syntactic context.
So for example, if you know that B is the theme of A and then C is in conjunction
or B, like B and A, regular B and A, then you will conclude that maybe C is also a
theme of A.
So because of this, we pick our framework -- we pick Markov logic as a
framework. Markov logic is a [inaudible] developed by Matt Richardson and
Pedro Domingos and it can compact will go represent very complex relations and
also handle the uncertainty.
So the bottom line is that you have -- it basically each Markov logic network
consists of a pair of first-order logic formulas together with a weight. And this
define a joint probability distribution using a [inaudible] linear model and the NI
here is the number of true grounding. So you don't really need to know -- pay too
much attention to the detail.
So in the interest of time I skipped the examples. So Markov logic is a natural
framework to conduct joint inference. And more there has been already efficient
-- a lot of efficient algorithm available and in an open-source implementation.
So Markov logic network for the bio-event extraction consists of a few formulas.
So the first group of formula basically -- so another advantage of Markov logic is
that you can express this intuitive regularity very compactly and straightforwardly.
So for example, we have the prior knowledge the event must have a theme and
regulation event only can have a cause and so forth. So we incorporate those as
hard constraints.
And then there are those soft evidence like they -- for example, activation, the
word activation probably referred to a positive-regulation event. This doesn't hold
all the time, but it hold statistically for quite a number of cases.
And then there is this syntactic evidence which we already mention like for
example conjunct of a theme is likely a theme as well.
And finally there is this lexical, combining lexical and syntactic, so for example
the subject for the word regulate is probably signifying -- actually this should be a
cause, I'm sorry.
So this last three groups are basically the alchemy can lend way for each specific
constant. So you can specify very compactly with field rules but then -- sorry
about that. So our base MLN basically consists of the first two. The first three
rules you can consider them as joint inference rule because they basically make
decision for between events and themes and also different themes and events
together. And then the base system consists of the first two views and then the
lexical evidence and the syntactic evidence.
So here are the preliminaries now on the development set and it's evaluating on
the ground atom, so not entirely the same as the evaluating on the events. But
this give us an idea.
So we get F1 of 64 and more importantly we get a pretty high precision of 74.
And let's take a look at the effect of the features. So if we start with just a base
MLN, it get an F1 of 34, but as long as we add the further joint inference rule we
get a huge improvement. And then if we add some lexical evidence and also
some preprocessing we get the 64.
And also we see that the training -- so here the training sizes were represented
by the number of abstracts. So we can see that going from 50 to 100 we get a
huge improvement. And then it start to level off.
So here are the number for just on the event types, like whether -- whether our
system can predict the event type correctly. So for those who have seen my last
talk, this number is slightly better because I have some new results.
So what you can see is that the first few type of events, the F1 is pretty high. So
one idea would be to actually start using this extraction, maybe we cannot use
this for all the events but we can start using this most accurate one to start
building a community or attracting initial consumers and et cetera.
And also another take-home message from here is that in general our precision
are pretty high compared to the pailing systems, and this is arguably good for this
domain because whatever you propose better to be correct. It's okay to miss
some, but when you propose it better not to be a whole lot of noise.
So in future work, we want to incorporate more joint inference opportunity. You
have already seen some candidate in earlier slide which we haven't been able to
incorporate in this MLN yet. We definitely want to incorporate discourse like
coreference, likes whether [inaudible] this coreference or [inaudible] ones. And
those are another special kind of joint inference.
And then finally we may start looking into some application into some product,
commercial product.
So in conclusion, I have presented a system for joint inference for bio-event
extraction. It's based on Markov logic. The preliminary results is pretty
promising, and also our system because of the benefitting from the joint
inference it attained -- tend to attain higher precision.
And finally maybe we can start thinking about some commercial application,
focussing on the most accurate ones. Thank you. And I would like to take
questions.
[applause].
>> Hoifung Poon: Yes?
>>: So do you see under the theme and cause that occur in the same sentence
as the event [inaudible] are you seeing the theme and the cause of an event will
occur in the same sentence?
>> Hoifung Poon: That's an excellent question. So actually no. The shared task
and notation there were a number of cases where the theme actually stated in
the previous sentence and then later on they are referenced by coreference.
There were like about five percent of them. And currently our system didn't
address that. I think none of the existing system addressed that. But that's
definitely opportunity.
You can see something like they mention a bunch of gene and then you have
sentence like this gene -- this gene are found to blah, blah, blah and so forth.
Yes?
>>: I'm wondering if the Markov logic approach would allow you to incorporate
information from the ontologies they have in this domain to get more actual
training data?
>> Hoifung Poon: Yeah, yeah. That's another great point. And in fact, we have
considered actually something along the same line but it's more like using some
unsupervised approach to solve a construction like clusters and then to multiply
the training data.
So I haven't quite thought about using the GENIA ontology. The ontology per se
might use some information but also there is a question like because none of
those textual expressions are marked with those -- are related, directly related to
the ontology. So you have ontology available but you also have text and then
there is a mapping between text and ontology. So that's always a problem for
formal ontology.
So I think if -- but if we can somehow construct text based ontology, maybe
unsupervised from the text, then it's much easier to relate them, and then we can
start multiplying training example that way. So that's an excellent point.
>>: So do you actually need to specify all those features before you do the
training?
>> Hoifung Poon: Yeah. Yeah. So this is totally supervised learning. And we
don't do any structured learning. So you do need to specify the feature.
>>: So how many features are there [inaudible].
>> Hoifung Poon: So there are roughly about, I would say, between 10 to 15
rules. So Markov logic, one of the advances that you can -- there is some
syntactic [inaudible] that allow you to specify in one rule and then the system will
learn way for each combination of constants.
So for example, you can specify rule saying I tried to learn way for reaching the
original word. What event type is signifying? And then you will try to learn way
for activation, regular way and so forth.
>>: So you have the hard constraints that each event must have at least one
theme. Did you find cases where the theme just wasn't mentioned in the text?
>> Hoifung Poon: That's a great question. So this is more like artificial effect of
how they annotate this shared task. So they -- when they share annotate -- so
there is another issue that I didn't get into in the slide is that how they annotate a
shared task data actually also plays some -- playing some difficulty in for the
learning and training. So in particular they have to -- they will only annotate
event if it has one theme that is -- that can trace down to a protein. And it's
explicitly stated.
So if you have like say you could have a sentence saying IL-2 regulate
something. But if that something is not protein, this whole thing is not annotated.
So that actually make the training a lot trickier from the inside. Okay. Thank you.
[applause]
Download