>> Bill Dolan: So I'm really thrilled today to... all the way across the lake, University of Washington, where...

advertisement
>> Bill Dolan: So I'm really thrilled today to welcome Alan Ritter who's visiting from
all the way across the lake, University of Washington, where he's a student with
Oren Etzioni. He'll be graduating this June, I think, with his Ph.D.
Alan is well known to many of you from his two stints here as an intern, first in the
machine learning group back in 2008 working with Sumit Basu, and then again in
2009 in the NLP group here with us.
Alan's work with Sumit in 2008 won Best Student Paper Award. This was working
on interactive machine learning techniques at IUI.
Alan's work is focused -- his thesis work is really focused at the intersection of
Natural Language Processing, social media, machine learning, and he's got some
incredibly cool results to show you, I think, today.
He has a Twitter specific tool kit. He's worked a lot with big raw noisy data like
Twitter trying to extract signal from it, trying to extract structured information that
can be used to enable new applications, and his tool kit has been used by many
people. He's got it distributed on the Web.
I'll turn it over to Alan and let him talk about his exciting research agenda as he
finishes up. Thanks, Alan.
>> Alan Ritter: All right. Yeah. Thanks, Bill. Yeah, I'm really excited to be here
and please feel free to ask questions. Okay.
So the Internet's really changed the way that people communicate, and this has led
to a huge amount of informal text that's available in electronic format.
So this includes things like e-mail, SMS messages, Twitter, and people are also
writing informal text in professional environments, for example, physicians' notes
and military field reports.
Okay. So from my perspective, the really exciting thing about informal text as
compared to formal text, say books or newspaper articles, is that the amount of it
that's written each day is simply much larger.
So just to make this a little bit more concrete, if we're to look specifically at Twitter, if
we take all the tweets that are written in just a single day and print them into books,
we'd have a pile of books that's as high as some of the tallest buildings in the world,
and so clearly, no person can read through all this text, and this is really why we
need some sort of automatic text processing techniques to extract and aggregate
and organize this information.
And so there is actually a lot of important information that shows up first on Twitter.
So one famous example of this is is this Twitter user who happened to be in
Abbottabad and live tweeted the raid that killed Osama bin Laden.
And so, of course, this user really didn't know what was going on, but it turns out
that also the first news that Osama bin Laden had been killed showed up on Twitter.
So there is this guy who is a high-ranking official at the Pentagon at the time who
leaked the information on Twitter before it showed up in the press. So people were
talking about it on Twitter first.
Okay. So, of course there's been a ton of previous work on Natural Language
Processing and information extraction which is focused on processing news articles,
and I think this actually makes a lot of sense because historically news has really
been the best source of information on current events.
And current events are a really good application area for information extraction, so if
our goal is just to extract some historical information or kind of encyclopedic
knowledge, it's really difficult to compete with the structured data sources like
Wikipedia and Freebase.
But news is also challenging for NLP applications for other reasons, in part because
it's just already pretty well organized to begin with. It's not that hard to just sit down
with the newspaper and get a good overall view of what's going on.
So in the meantime, social media has become a really important competing source
of information on current events, and so the status messages people are writing on
these social networking Web sites have very different characteristics from traditional
news articles, so they're short, they're easy for anyone to rate, and they're instantly
and widely distributed.
So because of these reasons, they often contain fresher information on a wider
variety of topics than news articles.
But of course, this lowering of the barrier to publication is kind of a double-edged
sword, so because these messages are so easy to rate, we get a lot of irrelevant
information, people talking about what they ate for breakfast, and there's also a lot
of redundancy, so we get many people all talking about the same thing.
And again, this leads to a situation of information overload and motivates, you
know, why we need automatic text processing techniques to extract and aggregate
information from this big, noisy text dataset.
Okay. So when we look at applying NLP and information extraction techniques to
Twitter, there's a number of challenges that come up. So for instance, there's a
huge amount of lexical variation, so Twitter users are really creative in their use of
spelling and abbreviations.
And just to give an example of this, I ran a distributional word clustering algorithm
on a large corpus of tweets, and so here you can see there's over 50 different ways
that Twitter users can refer to the word "tomorrow."
Okay. So secondly, tweets have really unreliable capitalization.
>>: How do you know to add tomorrow? Do you analyze the text syntax?
>> Alan Ritter: Yeah, you know, that's a good question. So we're basically just
looking at the context the words co-occur in, so they tend to occur in similar
contexts. That's what the distribution algorithm tells us, but I think these look pretty
reasonable, yeah.
All right. So in terms of capitalization, you know, users will capitalize words just for
emphasis or they'll often leave the whole message lower cased, and this is pretty
challenging for this named entity recognition task, you know, which is a standard
NLP task and for information extraction, you know, traditionally for news articles, at
least, named entity recognition relies heavily on capitalization, which isn't as reliable
here.
And then finally, tweets tend to have a unique grammar, so users will often drop
personal pronouns, for example, assuming that the subject of the sentence refers to
the speaker, and you just don't see these same kind of sentences in news articles.
Okay. So you might be wondering at this point, you know, how do off-the-shelf NLP
tools do when we apply them to Twitter. So I'm just going to walk through kind of a
standard NLP pipeline here going through this example and show where some
errors come up.
So first of all, the part-of-speech tagger thinks that the word "yes" is a proper noun,
which is a pretty big mistake, and you can kind of see why it's you know, capitalized
and it's this funny spelling, so it's probably out of vocabulary.
Then the chunker missegments "Its official Nintendo" as a noun phrase, which is
another big mistake, and the named entity recognizer missegments "America" as
the location, whereas if you look carefully, you'll notice it should have really been
North America.
And so I'm not even highlighting all the mistakes here, but the point is that Twitter
has this noisy and unique style which these tools designed to work on grammatical
text were just never meant to handle.
>>: [inaudible] Do you use this?
>> Alan Ritter: Oh, so this is like a state of the art, like off-the-shelf tagger, so these
are from the UIEC group, actually. Yeah, so we also --
>>: It's not trained [inaudible] ->> Alan Ritter: No. Yeah, it's trained on these data, yeah, yeah. Right. So of
course, yes. I mean, to deal with this we've rebuilt an NLP pipeline which is trained
on in-domain Twitter data.
And so, you know, the main approach we're taking here is a supervised learning
approach, so I basically just went and annotated a bunch of tweets with
part-of-speech tags, shallow partial chunk tags, named entities and events.
We're also using some semi-supervised techniques, for example, using these
unsupervised word clusters as features, and, you know, I think there's a lot of
interesting work on unsupervised learning for syntax, but we're really trying to take a
practical approach here and get something working on Twitter, so that's why we're
taking this supervised approach.
So I think there's actually a lot of room for interesting work in unsupervised learning
for these more semantic level tasks, and I'm going to talk about these later in the
talk. So we've done some work on, you know, named entity categorization,
classifying the events, and also unsupervised relation extraction.
Okay. So here I'm showing the performance of our NLP tools that the shallow
syntactic annotation tools compared against off-the-shelf tools which are tuned to
work on newswire text, and you can see that in each case we're doing much better
than the off-the-shelf tools.
We've made these tools available on GitHub, so you can go download them and
use them if you're interested, and we've actually found that there's a relatively large
number of people that are finding them useful.
Okay. So given that we have access to these tools which are tuned to work on this
Twitter data, a natural question is, you know, what can we actually do with them.
So to give one answer to that, I've built a system which is automatically extracting a
calendar of popular events coming up in the near future. So to do that, we're
continuously processing a stream of about 2 million tweets per day, running our
NLP tools on them, and extracting, for example, named entities and events.
In addition, we're also extracting and resolving temporal expressions. So, for
instance, if we see a phrase like "next Friday," we can actually figure out the
calendar day that's referring to.
So then we can just count the number of times that each named entity co-occurs
with the reference to each date and use the statistical tests to determine whether
there's a strong association there and plot the most strongly associated events on a
calendar.
Okay. And so I also want to highlight briefly here there's a number of systems
building issues that come up when we try to do this.
So, you know, because we're processing so much data here, we found that just
using a standard relational database just couldn't keep up with the number of
inserts. Instead, we had to move to using this distributed NoSQL database.
Okay. So I'll go ahead and show a demo of this, and this is available online. I'd
invite you to go check it out.
So today is, you know, April 2nd, so looking ahead to Sunday, you can see there's a
lot of people mentioning something about the FAA. We can click on that to drill
down and get some more detail.
And so here I'm just showing all the tweets that mention FAA in addition to April 7th,
and you can see that they're basically shutting down 173 air traffic control towers
this Sunday. And of course, this is due to the sequestration.
So looking a bit further ahead, you can see that next week on Saturday the 13th,
Mubarak, who is the former president of Egypt, has a retrial scheduled, and then on
the 14th Venezuela has new presidential elections.
And so there's a lot of other stuff on here, as well. I'd invite you to go and kind of
check out.
>>: All character or only news part?
>> Alan Ritter: Yeah, it's from all the categories. So we do do some language
identification to filter out only the English tweets, but, yeah. So this is kind of open
domain. It's on kind of all different events and sources.
>>: What is the key words said ->> Alan Ritter: Yeah, right. So this is -- these are these event words that we're
extracting, so we've annotated some data to train an event phrase extractor. So
this is kind of similar to the Timebank corpus, if you're familiar with? So we're doing
sort of a similar extraction task on Twitter.
>>: How do you rank the topic?
>> Alan Ritter: Yes, that's the statistical test, right? So we basically look at how
frequent is the entity and how frequent is the date and then how often do they
co-occur together, right? So if we just go straight directly by frequency, then, like,
Justin Bieber would be the most frequent thing for every day basically, right?
>>: How did you filter out those information like [inaudible], probably more things
like that.
>> Alan Ritter: Yeah, so this kind of comes out naturally from the statistical test, so
there has to be like something that's happening really strongly associated with this
day.
So if people mention what they're eating for breakfast every day, it will -- you'll have
to see it mentioned very frequently to kind of overcome, you know, the baseline
rate.
Okay. All right. So I think this just provides some motivation for why we want to do
information extraction and Natural Language Processing in Twitter, but -- so we
were actually pretty surprised to see that this worked so well. So if you try to do
something similar with newspaper articles, it's actually a pretty difficult task.
So just to kind of explain why that is, for example, in this instance to figure out when
the bomb attack mentioned in the first sentence takes place, we have to recognize
that the blasts mentioned in the second sentence is referring to the same event and
also that the blast takes place on Saturday.
And then to make things even more complicated, we have to further recognize that
this other bomb attack, which is mentioned later in the article, is referring to a totally
separate event which happened on a different day.
So in order to kind of link together all the information and news articles, we have to
solve these discourse level processing tasks which link information together across
sentences, and these are some of the more difficult and, I would say, unsolved
problems in NLP.
There's a lot of interesting research going on in this area, but we don't really know
how to solve these problems in the same way we know how to take an individual
sentence and process it.
So in contrast, tweets tend to have really simple discourse structure. So users on
Twitter sort of say things in really straightforward and in compact ways, and to kind
of understand why this is, if you imagine a user writing a message on Twitter, they
typically assume that it's going to get mixed up in the feeds of all their followers and
so they don't assume any context that it's going to be understood in.
In contrast, the sentence in a news article is really meant to be understood within
the discourse context of the article.
Okay. And so the point to take away here is that by working with these short
informal messages on Twitter, we're able to sidestep some of these complicated
discourse issues.
Okay. So given that we can do a pretty good job of extracting open-domain events
from Twitter, a natural question for us was whether we can categorize them into
high-level types, for example, sports events, political events, product releases, and
so on.
And so this would have a number of benefits. Probably the most obvious thing here
is that it would allow users to browse more customized calendars which match their
interests.
So there's a number of challenges that come up when we look at categorizing
events on Twitter, so the main thing is that there's just a huge number of different
types of events that people can talk about, and in advance we're really not even
sure what the right set of event categories is.
So furthermore, the set of important types might actually change over time as new
topics become more or less important,
Or if we want to focus in on a specific group of users, there might be a different set
of categories which best describes the data.
Okay. So to address these challenges, we're proposing an unsupervised approach
to event type induction which is based on topic models, and this is actually based
on some work we've done on modeling selectional preferences with topic models.
And so this approach has a number of advantages. It allows us to automatically
discover an appropriate set of event types which match the data. We don't need to
annotate any individual events in context, and we don't need to commit to a specific
set of event types in advance like we'd have to do before annotating data.
Okay. So how do these generative probabilistic models work is we first make up a
story about how our data was generated, which involves hidden variables and
probabilities, and then given a fixed dataset, we're going to apply Bayesian
inference techniques to infer values for the hidden variables, which will then tell us
the category of each event in context.
So I'm just going to walk through a really high-level explanation of the generative
story for our model here, just to kind of make it clear what's going on.
So we'll start out by grouping together all of the tweets which mention the same
event phrase, so in this case all the tweets mention the word "announce."
Okay. And so then we're going to have a set of event types, so each type will have
an associated probability distribution over a named entities. So for example, for
product releases, we might see entities like Microsoft, Samsung and iPhone, and
then similarly we'll have a type for, you know, politics and sports.
So then each event phrase will have a distribution of release types. For example,
you can see that announcement could be part of a product release or a political
event or a sporting event.
And then to generate the named entities in our data, we'll just repeatedly draw types
from this distribution and then generate the named entities in our data based on
the -- you know, from the associated types.
Okay. And so this just -- oh, yeah.
>>: I don't understand what you meant by the event, so you just take the verb, if the
verb matches?
>> Alan Ritter: Yeah, right. So one of the -- part of the NLP tools that I kind of
glossed over a little bit was extracting phrases which refer to events, right?
>>: Okay.
>> Alan Ritter: So we basically annotated data using the same guidelines as like
the Timebank corpus and then so, you know -- right. So it could be like verbs or
nouns can refer to events. You know, you could have sort of like attack or, you
know, attacked, right?
>>: So the event -- but are they prescribed event types or are you clustering to
create the event types or ->> Alan Ritter: Yeah, we're clustering to create the event types, right.
>>: Okay.
>> Alan Ritter: Yeah. Okay. Anyway, so this just kind of describes the generative
story here. In practice what we'll actually do is to apply Bayesian inference
techniques which will then give us reasonable values for the hidden variables, which
will then tell us what the types are and also tell us the type of each event in context.
Okay. So we gathered about 65 million of these entity event date tuples, which are
basically the same thing we're showing on the calendar, and for inference we're
using collapsed Gibbs Sampling, which is actually a pretty standard approach to
inference in these types of models.
In practice, we actually use this parallelized approach to Gibbs Sampling, so I
should mention that Gibbs Sampling is really an inherently sequential procedure,
but there is some theory that explains that the parallelization can be understood as
an approximation to the sequential sampling, and we actually found this to work
really well in practice and it lets us scale up to much larger datasets.
>>: So by parallelizing, you mean you have different chain of ->> Alan Ritter: Yeah, yeah, you run a separate chain on each machine and then at
the end of each iteration basically, you know, synchronize the counts.
>>: [inaudible]
>> Alan Ritter: Yeah, that's a good question. So that's basically -- right. So the run
time here is sort of like order the size of the corpus times the number of types. So
this is kind of as many as we could do in a reasonable amount of time. Yeah. So of
course there's, you know, there's work on like nonparametric models that try to find
the right set of types.
I think for these, if we're just trying to find event types, the number of types kind of
doesn't matter that much. If you're looking at something like co-reference
resolution, then getting the right number of types is really important and maybe the
nonparametric approaches are more appropriate in that situation, but, yeah.
>>: Can you go back to this Bayesian work model? I want to see exactly which
state corresponds to what kind of ->> Alan Ritter: Oh, this part, you mean? Yeah. Right, right. Totally. So this is -- I
kind of gave a little bit of a simplification there. So basically -- right. So here we
have the -- for each type it has a distribution over named entities and then also a
distribution over dates on which events of that type occur.
And then the theta up there is the distribution over types for each event phrase,
right? If that makes sense. So maybe I can use the laser pointer. It will be a little
easier to ->>: And the clustering that you did, [inaudible] where does it go?
>> Alan Ritter: Yeah, so basically -- right. So these hidden variables here, if we
infer values for these, these will basically tell us the category of each event in
context, right?
And it will also tell us, you know, if we just -- basically in the inference, you know,
we do this collapsing, so we integrate out these parameters, and then also these
parameters, so all we're really doing is inferring values for these, which then you
can kind of read off, you know, what are the ->>: Observation that you have are the words [indiscernible].
>> Alan Ritter: Yeah, so the observations are basically the event phrases which
we've -- so we're using a linear chain CRF to extract event phrases and then also to
extract named entities, so this is kind of on top of that, right.
>>: [indiscernible] this one thing [indiscernible].
>> Alan Ritter: Oh, oh, oh. Right, right, right. Yeah, so these are the dates.
>>: [Inaudible]
>> Alan Ritter: Yeah, so this basically -- right. So we also extract and resolve these
temporal expressions, so I kind of -- I'm skipping this in this kind of higher level
story.
But so basically each type of event has a distribution over dates on which events of
that type happen, too, so this kind of helps to group together tweets which are
referring to the same event, which, of course, have the same type.
>>: So the [inaudible] observation really just not the word, it's the [inaudible].
>> Alan Ritter: Exactly, yeah, uh-huh.
>>: So what's the resolution of the date? So is it like day or month or week?
>> Alan Ritter: Yeah, so we just went with just days, right? So you could imagine
someone says like, oh, at 8:00 o'clock on Thursday I'm going to do this. But for the
Twitter data it seems like most events that people are talking about have just -- they
just give you the specific date on which it happened.
>>: So you mean for each event type you have a distribution or [indiscernible] 365
days?
>> Alan Ritter: Yeah, or just all -- more than 365, so it could be at any time, yeah.
Right. So this is a little bit counterintuitive and we tried doing it with and without
this, but basically the effect that ends up having is grouping together events that
happen on the same day. So maybe on this day there's like a really big, you know,
sports event that's happening and so ->>: So you showed earlier on all this, you know, tagger, information extraction are
subject to error because of the noise --
>> Alan Ritter: Yes, yes.
>>: So I hear the solution will be that that parameters on those model ->> Alan Ritter: Yeah.
>>: -- would be part of that model so you can have end-to-end learning to correct
those errors?
>> Alan Ritter: Yes, totally.
>>: You separate them.
>> Alan Ritter: Yeah, I'm doing it separately. I think it would be interesting for
future work to look at a joint model here.
So I think the advantage of this approach is that the inference remains pretty simple
and so because we're working with a lot of data, it lets us scale up to a lot of data. I
mean, I think that it's interesting for future work to try and do an end-to-end joint
approach, yeah.
>>: I just want to make sure I understand. So the under event types have been
clustered ahead of time before you've computed -- before you've done this
inference, right? So you do that and then you create the model and you infer the
distributions events using whatever over the event types or is the cluster happening
during the ->> Alan Ritter: Yeah, the clustering is happening during the inference, yeah.
>>: The byproduct of the ->> Alan Ritter: Yes, yes, that's correct, yeah. So right. So basically we just find for
each named entity sort of like what event type does it have and then at the end we
can kind of read off the clusters from that, right.
>>: But you are limiting it to, like, to 100, so there's somewhere in the inference in
the [inaudible] process where it's re-clustering sort of on every -- it's doing like an
iterative re-clustering thing?
>> Alan Ritter: Yeah, yeah, yeah. Right, right. So basically the Gibbs Sampling
procedure, how it works is we basically go through all the data and then sample a
new -- for each hidden variable we just sample it, you know, as we're going through
the data, resample a new value for it and then -- yeah, and do that through the
whole data a couple times, basically, or you know, a thousand times.
Okay. All right. So anyway, so these are some of the event types that are
automatically inferred by our model and I think these look pretty good.
So for example, we have like a sports type here where we see event words like
"tailgate," "scrimmage," "tailgating," "homecoming," and "regular season," and then
entities like ESPN and NCAA, Tigers and Eagles.
Oh, and I should mention, by the way, that these labels are just kind of my
interpretation of what the events are. These aren't automatically generated.
Then we also get a nice TV event here where we see event words like, you know,
"new season," "season finale," "new episode," and then we're seeing some TV
shows like Jersey Shore, True Blood and Glee, and also TV networks like HBO.
Okay. And so we also did an evaluation where we looked at the ability of our model
to actually categorize events in context, and so I manually annotated some data
with the event types which were automatically discovered by our model, and so
here we're comparing against supervised classifier as a baseline, and basically the
point here is that by using large amounts of unlabeled data, we're able to do better
than the supervised baseline.
>>: How did you [inaudible] supervised, because these were the inferred
categories.
>> Alan Ritter: Yes.
>>: So you're trying to make your own interpretation and ->> Alan Ritter: Yeah, yeah, so basically we run inference in the model. It will give
us -- it will automatically infer some categories, right? And then basically I'll use the
same categories that the model kind of automatically inferred to annotate some
separate data, if that makes sense.
>>: To some degree then this is a test of how well you understood the underlying
categories that were inferred?
>> Alan Ritter: Yeah, that's true. It's a little bit -- I mean, I'll agree that it's a little
bit -- right, it's a little bit weird. I mean, I see your point, but, I mean, I think there's
some advantage to like just automatically finding the right set of categories for the
data.
>>: I agree. I was just wondering [inaudible].
>> Alan Ritter: And I mean, right. Right. So I mean, I think it's kind of -- it is a little
bit odd, I'll say, to say like, oh, we have this unsupervised model that's doing better,
and I think part of the reason why people have, you know, found that
unsupervised -- that supervised models tend to do better is because what the
unsupervised model is finding doesn't really match up with your idea.
So this is kind of like I'm going to, like, let the unsupervised model find something
and then I'll, like, use that to annotate the data with. Yeah. That's basically what's
going on here.
Okay. Right. So this unsupervised approach to information extraction has a
number of advantages. So it lets us scale up to large unlabeled datasets. We don't
need to specify the right event categories in advance.
But I think there's an interesting question here which is, you know, what to do in the
situation where we have access to large amounts of structured data, for example,
from, you know, Freebase or Wikipedia.
And this is actually the case in this next task that I'm going to talk about, which is
named entity categorization. So here it's pretty easy to get large lists of named
entities and their types from these structured data sources.
Okay. So there's a number of challenges that come up in named entity
categorization in Twitter, so there's a huge number of different types of named
entities that people are talking about. They're talking about, you know, bands,
movies, products, and so on.
And many of these are going to be relatively infrequent in the data. So even in a
really large manually annotated dataset, there's going to be few examples of these
infrequent categories. So because of this, I think we can't simply rely alone on
unsupervised learning alone here.
Okay. So the second thing that's challenging is that tweets are very terse often, so
for example, in this instance it's really hard to know what type of entity KKTNY is
referring to without some additional background information.
Okay. So to address these challenges, we're proposing a weakly supervised
approach to named entity categorization which uses large lists of named entities
and their types gathered from Freebase as a distant source of supervision.
So of course, we can't simply just look up a named entity to figure out its type and
context, so for instance, if we look up "China" in Freebase, we see that it could refer
to either a country, there's also a band called China, there's a number of different
people whose name happens to be China, so we need some way to disambiguate
between these possibilities.
Okay. So do that, we're proposing a new approach to distant supervision which is
based on constrained topic models, and so like I mentioned or like I kind of alluded
to in the previous slide, just applying distant supervision directly to this task doesn't
work because the training data is just too ambiguous.
So instead, we're proposing a latent variable model for named entity categorization
which uses the Freebase dictionary as constraints in the model.
Okay. And I'll try and make this a little bit more clear on the next slide.
So again here I'm kind of showing the high-level version of the generative story for
our data, so in this case we're grouping together all the tweets which mention the
same named entity, and then each entity type has a distribution over context words
which co-occur with mentions of that entity in context.
Okay. So, right. So the key difference here is that these type distributions for each
entity are constrained based on the Freebase dictionaries. So for instance, if we
look up "JFK" in Freebase, we might see that it could refer to either a person or an
airport, and then we'll constrain its possible distribution over types based on these
possibilities.
Okay. And then like before, we'll repeatedly draw types from this distribution, and
then to generate the context words in our data, we'll pick them from the associated
entity type, type distributions.
Okay. And so again, this is just a description of the generative story. In practice,
we apply Bayesian inference techniques to infer values for the hidden variables
which then tell us the category of each named entity in context.
>>: So what's your context?
>> Alan Ritter: So basically, right. So -- right. So we just use the previous and
following three words and then we also use the words in the entity as well, right?
Okay. So here I'm showing some example type lists which are automatically
generated by this model, and this is just for three out of the 10 types that we're
working with. And these -- I should mention these are also some of the more
difficult types. So the easy things are like person and location, but I think these
actually look pretty good.
So for example -- well, right. So this is -- basically these are the top 20 entities
which weren't found in any of the Freebase dictionaries. So these are like words
that our model was able to categorize automatically.
So for products we're seeing things like Nintendo DS Lite, Apple iPod. There's
some segmentation errors in here as well, but, you know, iPod, Nano and so on, but
I think what's really cool here is that we're able to correctly categorize some of these
Twitter specific abbreviations which you just wouldn't expect to find in Freebase.
>>: [inaudible]
>> Alan Ritter: Yes, so these are TV shows. I don't know. I hear about all this
stuff. I don't -- I don't actually watch these shows, though. I don't know. That's part
of the fun of working with Twitter data.
Okay. Anyway, so we also looked out how well our model can actually categorize
named entities in context, so I annotated a large corpus of tweets with named
entities and their types.
And here I'm showing our performance compared against a bunch of baselines,
including a supervised baseline, which actually does really well on the more
frequent types like person and location, but does poorly on these infrequent types
where there's few examples in the training data.
We also compared against the co-training approach to weakly supervised named
entity categorization proposed by Collins and Singer, and you can see that we're
actually doing quite a bit better here.
Okay. So why is it that LDA is winning in this case? So I think there's a couple of
reasons for this. So the first is that it's able to share information about our entities
type across mentions in a really nice way.
So these -- so basically we can, you know, figure out the right type of the entity in
these highly ambiguous cases by looking at the same entity in other contexts.
So the other thing is that, you know, because we're using these Freebase
dictionaries as constraints in the model, we're just better able to take advantage of
this highly ambiguous training data, so we don't have to just rely on these
unambiguous cases to learn how to categorize the entities.
Okay. Oh, yeah?
>>: On the supervised baseline, what did you do for that?
>> Alan Ritter: Yeah, so I annotated about 2400 tweets with named entities and
their types and we basically just used like a maximum entropy classifier.
>>: Is it they using the same context, the three words before, three words after?
>> Alan Ritter: Yeah, the same features.
>>: Did you use Freebase then?
>> Alan Ritter: Did it use Freebase?
>>: Yeah, did you take advantage of the ->> Alan Ritter: Oh.
>>: -- context of anything that appeared on Freebase?
>> Alan Ritter: Yeah. No, I think it didn't look at Freebase at all. So we did also
have a Freebase baseline where we basically look up the entity in Freebase and if
it's unambiguous, you know, make that prediction. And this actually has really high
precision but the recall is pretty low.
Okay. All right. So, yeah. So I just talked about this new approach to distant
supervision based on topic models, constrained topic models which is appropriate
for the situation where you have highly ambiguous training data like this named
entity categorization task.
And so there was a natural question that came up while we were working on this
which is what happens when there's missing information in either of the text or the
database.
And so the answer is this leads to errors in the training data, and this is a really
general problem that affects distant supervision both for this weakly supervised
named entity recognition task I talked about and also for the more common
application of distant supervision which is extracting binary relations.
Okay. So for the sake of comparison to previous work, so here we're looking at the
case of binary relations, and so it turns out that most of the work in relation
extraction uses a huge number of features which are highly correlated and
overlapping, and so these generative models that I've been talking about are kind of
not a very good fit for this data because they make really strong independence
assumptions.
So instead, at this point in the talk I'm going to move on and talk about conditionally
trained models.
Okay. So this is kind of like what the setup looks like for the -- for extracting binary
relations using distance supervision. So we'll start out by having a relation, in this
case the born-in relation, and then Freebase is going to give us a large list of people
and the locations where they're born.
Okay. So basically for each pair of entities here we can go and search through a
large text corpus and find all the sentences which mention the entity pair, so
"Barack Obama" and "Honolulu" in this case, and then we can basically just treat
these as positive examples of the born-in relation and extract features from these
sentences and train a supervised classifier.
Okay. So this is great, but the problem is is that, you know, what happens if there's
some information missing from Freebase.
So in this case, all of these sentences function as now a negative training example
for the born-in relation, which I think you can see is a problem, and this is actually a
pretty common scenario.
So there's actually a lot of information missing from these databases, and that's kind
of the whole reason why we want to extract information from text in the first place.
Okay. So before I get into the solution of how we're going to deal with this, I'd like
to just walk through a model for distant supervision in the context of extracting
binary relations. So we'll start out with a pair of entities, "Barack Obama" and
"Honolulu" in this case, and then we're going to get to observe all of the sentences
which mention this entity pair.
Okay. So now we're going to have a classifier which is going to predict for each
sentence what relation it mentions between these pair of entities. So unlike the
standard supervised learning setup, we're not going to actually get to observe these
variables during training. Instead, we only get to see these aggregate level
variables which tell us which relations hold between Barack Obama and Honolulu
and aggregate.
So the question here is how do we relate these aggregate level variables that we
get to observe with the hidden sentence level relation mention variables.
And so one answer to this question is a simple deterministic OR function, and so
basically what this is saying is if any of the sentences mention that Barack Obama
was born in Honolulu, then this fact is true. If none of the sentences mentions it,
then it's false.
Okay. And so we can then tune the parameters of this model by just maximizing
the conditional likelihood of the observed facts in Freebase conditioned on the
observed sentences in the data.
Okay. So for learning here, we're taking an approach based on the structured
perceptron, which is an iterative gradient based update to the weights.
In addition, we're taking an online approach to learning, which just means we
update the parameters after seeing each pair of entities rather than going through
all the data and doing batch updates.
Okay. So this is what the gradient looks like. It's just the difference between these
two expectations over the features. And so in practice, these expectations are too
difficult to compute, so instead, we approximate them with maximizations.
So basically what's going on here is we have two inference problems, one where
you want to find the best assignment to the sentence level hidden variables
conditioned on the observed sentences and facts in Freebase, and the other case
we just want to find the best assignment to the sentence level hidden variables
given the current parameters but ignoring Freebase.
Okay. And so the unconstrained inference problem is totally trivial, but the
constrained problem is a little bit more complicated, but it turns out that it reduces to
this weighted edge cover problem, which we can solve exactly in polynomial time,
so this works out pretty nicely.
Okay. So there's two assumptions that are being made here. So if a fact isn't in
Freebase, we can't extract it from any of the sentences; whereas if a fact is in
Freebase, we have to extract it from at least one sentence.
And so these assumptions are good because they help to drive the learning, but in
the case of missing information, either the text or the database, they lead to errors
in the training data.
So how might we modify this model to more gracefully handle the situation of
missing data? So what we're proposing to do here is to take these aggregate level
hidden variables and split them into two parts, one which represents whether a fact
is mentioned in the text and the other which represents whether it's mentioned in
the database.
And so then these factors between the two variables are going to encourage but not
require that they agree with each other. So now you can see that the facts in
Freebase are acting like soft constraints, whereas before they're like hard
constraints.
So for example, now it's possible to extract a fact that's not in Freebase if the local
classifier is highly confident, but of course, we're going to have to pay some penalty
for doing that.
Okay. So the learning is pretty similar to before. So the only difference here is that
we're now maximizing over these additional aggregate level hidden variables that
we've introduced.
And it turns out this doesn't make any difference for the unconstrained inference
problem, but the constrained inference problem gets a little bit more difficult, so it no
longer reduces to this weighted edge cover problem in a nice way like we had
before.
Okay. So of course, the question here is how are we going to solve this inference
problem. So again, the goal is to find the best assignment to the sentence level
hidden variables conditioned on the observed sentences and facts in Freebase, and
like I mentioned, this is just kind of an optimization problem with soft constraints.
Okay. So basically what we found here -- so right. So we looked at a couple
different approaches. So we looked at some exact inference approaches like A star
which are, you know, time and memory intensive and so don't really scale up to
these really large datasets we're working with.
But we found that a local search almost always finds an optimal solution, so long as
we use a carefully chosen set of search operators that are designed so that it
doesn't get stuck in a local maxima.
And so to verify that we're finding optimal solutions, we can compare it against the
solution found using A star, and we found that in over 100,000 problems from our
actual data, we only missed an optimal solution three times using this approach.
And so this is nice because it's fast and memory efficient and it almost always finds
an optimal solution.
Okay. So of course, the real question is how does this affect the performance in
terms of precision and recall, and so the answer is it actually makes a big
difference.
So here I'm showing precision and recall curves on the sentence level extraction
task comparing against human annotations from the data.
And so the red curve here is the system which is using hard constraints. By simply
relaxing those as soft constraints and setting two hints at parameters in the model,
we're able to get the black curve, which is actually a huge improvement, and then
by incorporating some additional information in the form of a missing data model,
we're able to do even better, which is the green curve here.
And so I think people realize that these distant supervision models are making
some bad assumptions about the data. I mean, all models have to make
assumptions, right? But I don't think they realize there's this much room for
improvement by better modeling the data in this distant supervision problem.
Okay. So I'd like to just pause for a minute here and summarize what I've talked
about so far. So I presented an analysis of the challenges in applying information
extraction to noisy text, I talked about our NLP tools we've adapted to Twitter, and
these are available online. You're welcome to go and use them.
I showed this demonstration of a system I've built which is automatically extracting a
calendar of popular events coming up in the near future.
And then I've presented three different probabilistic models for unsupervised
information extraction, one which is doing unsupervised event categorization.
I also talked about this new model for distant supervision using topic models which
is appropriate for the case of highly ambiguous training data; and then I also
showed this recent work we've been doing on modeling missing data in distant
supervision.
Right. So I'd like to spend just a little bit of time mentioning some other work that
I've been doing during my Ph.D., so I've been collaborating with some folks here at
Microsoft Research, Bill and Colin, who's now at NRC, on modeling conversations
in social media.
So in addition to talking about popular events, users of these social networking Web
sites are having conversations on a really large scale, and I think this opens up all
kinds of new opportunities for data-driven conversation modeling.
So for example, we've done some work on unsupervised modeling of dialogue acts
and also automatically generating responses to Twitter status messages, and so I'll
elaborate just a little bit on this second point.
So the approach we're taking here is based on statistical machine translation. So in
machine translation the task is given some foreign, text we want to translate it into
English, and in order to learn a model to do this, we have access to large parallel
corpora of paired foreign and English sentences.
So I think in some sense this conversational task is actually kind of similar, so, you
know, given an arbitrary user utterance, we want to generate an appropriate
response to this, and to learn a model to do this, we have access to millions of
naturally occurring conversations from Twitter.
Okay. And so at a high level how phrase-based translation works is, you know,
given an input sentence, we first segment it into phrases, and then we translate
each phrase in the input into a phrase in the response, and so this is a little bit
different than the machine translation case, but, you know, so there's potentially
some reordering here.
And to find a good translation, we want both good translations at the phrase level
and also a high score according to a language model.
>>: Somehow you require certain kind of understanding in order to generate a
response, right, not just like -- [inaudible].
>> Alan Ritter: Yeah. No, that's a good point. Yeah. So conversation and
translation are two very different problems, so ->>: [inaudible] approach to solve that problem?
>> Alan Ritter: Yeah, right, right. So we can't have kind of very deep intellectual
conversations. Basically we're learning these kind of high frequency response
patterns like, you know, if I see "airport" in the status message, maybe say "safe
flight," you know. Or like "I" translates as "you," or like, you know, "dinner"
translates as "yum."
So these aren't very like deep, you know, kinds of conversations, but, yeah. But,
right, so we have a demo of this available online you can go play around with.
So for example, "I'm feeling sick" translates as "feel better soon," and this is like end
best output, so you can see the other things. [laughter]
And, right. So this is kind of cute and all, but I think there are actually some
interesting applications here. So one might be, you know, conversationally-aware
predictive text input or speech recognition.
So assuming that a user just, you know -- or assuming that your friend just sent you
a text message and you're typing a response to it using some noisy input
mechanism, I think we can actually do a better job of predicting what you're trying to
type by taking the message that you just received into account.
So for instance, if your friend, you know, texted you saying "I'm feeling sick," we
should be able to do a pretty good job of predicting how you might respond to that
without even seeing any input from the user.
>>: Is that just [indiscernible] mentioned some things in the background?
>> Alan Ritter: Yeah, that's a good question, right. So, I mean, the hope here is
that by, you know, combining information, you know, actually generating a
customized response, we can do better than -- or handle sort of a wider range of
different things.
I mean, it's -- you know, I haven't done this experiment, right? So it's hard to say. I
mean, template matching is probably a pretty strong baseline for this for sure.
Yeah.
>>: So here the idea is that you use -- you learn this model and then that can serve
as a language model to interpret those noisy like speech?
>> Alan Ritter: Exactly, yeah, yeah. So you could probably get like the sort of
translation lattice out of this or something and combine that with, you know, lattice
from, you know, speech recognition or something, yeah.
Okay. So then I've also been doing some work recently in collaboration with folks at
New York University and actually also with Bill, so -- but on paraphrasing between
different language styles.
So for instance, you know, we've been looking at paraphrasing Shakespeare's plays
into modern English and also, you know, modern English into kind of a
Shakespearian style.
So the approach we're taking here is to -- we've basically found -- we've scraped a
bunch of modern translations of Shakespeare's plays off the Web which we can
then use as parallel text to build translation models.
And so like, for example, one thing we can do is paraphrase lines from modern
movies into Shakespearian style. So for example, "If you'll not be turned, you'll be
destroyed" gets translated as "If you'll not be turned, you'll be undone." And,
"Father, please help me" is translated as "Father, I pray you, help me."
And so there's a demo of this online as well, and again, these are kind of some fun
examples, but I think there are actual applications here, as well. So one thing would
be educational applications.
So it turns out there's only modern translations for 17 out of the 38 plays that
Shakespeare wrote, so by translating these other plays into modern English, maybe
we can actually make them more available to students, and, of course, there's a ton
of other authors from the same time period as well.
I think looking at paraphrasing between language style and other domains is also
interesting. So for instance, paraphrasing, you know, technical documents into
more easily understandable English or paraphrasing between, you know, noisy
informal text on Twitter and more formal text.
Okay. So for future work, one thing I'm interested in looking at is extracting richer
semantic representations of events from microblogs, so, you know, I think people
have spent a lot of time working on information extraction in news articles but, you
know, there's still a lot -- I think there's still a lot of opportunity to sort of, you know,
do a better job of extracting events from Twitter.
So one problem here is sort of solving this event reference resolution problem,
which is to group together all of the tweets which are referring to the same event.
So we haven't really solved this yet.
So I think the representation that we're currently extracting is nice because it's open
domain, but I think there's sort of more opportunity here to, you know, extract a
richer representation.
And so also related to that is this problem of schema discovery which is, you know,
given these automatically discovered event categories, can we automatically, you
know, extract schemas for them.
So for example, for a concert, you know, we might expect to see like a music artist
that's performing at the concert, also the venue and the city where it's taking place,
and then ideally we'd be able to -- like to automatically extract and fill out these
templates.
And if we can do this unsupervised, then I think we can do it in an open domain way
which isn't restricted to a specific type of an event.
Okay. So a little bit more of a longer term agenda for future work here, I'd like to
look at scaling up grounded language acquisition to more realistic and open-world
domains. So I think we're in a really exciting position right now.
So we have access to all kinds of realtime text in different languages, and in
addition, we have all kinds of realtime sensor data, so for example, realtime data
about the weather, you know, traffic, you know, seismographic data.
And so I think there's an interesting question here of whether we can link events
that people are talking about in text with, you know, events that are showing up in
sensor data, and they're kind of temporally correlated with each other to give us
some kind of signal there.
Okay. So one possible approach to this would be to extend some of the latent
variable models that I've been talking about to incorporate both realtime text and
sensor data, and maybe by doing this we can ground the meaning of these
distributional semantic representations that we've been learning in real-world sensor
data at scale.
Okay. So I'd like to wrap up and just thank my collaborators, and thanks for coming
to my talk. [applause]
>>: [indiscernible] So I want to know, so this is a pipeline, right? So [indiscernible]
but the problem [indiscernible] they have a lot of duplication.
>> Alan Ritter: Right.
>>: So basically how important it is to have -- improve your [indiscernible].
>> Alan Ritter: Yeah, I know, that's some good points you're making there. Right.
So I think -- right. So in the calendar at least we are kind of exploiting a little bit this
effect of redundancy, so if we see lots of people saying the same thing in different
ways, that helps to improve the precision rate.
Yeah, so right. So definitely I think there's room for improving the performance of
part-of-speech tagging and named entity recognition, so, you know, one thing is just
annotating more data, like you're saying, some kind of joint model that looks across
different tweets, right, is also really interesting.
I think, yeah, you guys have been doing some work in this direction, too, which I
think is great. But, yeah, right.
I think -- I guess my feeling is that probably the performance on these shallow
syntactic annotation tasks like part-of-speech tagging and entity named recognition
is going to always be lower on Twitter than what we see in news articles just
because it's so diverse and noisy, right?
So it's kind of more challenging from that aspect, but I think once we get past these
noisy text issues, there's actually other things that become easier, right? Because it
has this really simple discourse structure like I mentioned.
So it's kind of just this interesting domain with different characteristics than what
people have mostly focused on, you know.
>>: In the work about conversations, it looked like you were focusing on sort of the
stimulus response pair. Conversations can go on sort of longer. Did you look at -do a sort of modeling sequence [inaudible]?
>> Alan Ritter: Yeah, so we did in a little bit of a different context, so we also had
some work on unsupervised induction of dialogue acts, so these would be things
like, you know, trying to classify each post, like is it a question or an answer or a
status post and things like that. So, yeah, so there we're looking at the sequence in
kind of longer conversations.
Yeah, for the response generation task we just picked like the first message in a
conversation and then the response to that, just because that kind of constrains the
problem a little bit, right? But, yeah, doing that in the context of longer
conversations I think is interesting, but we haven't looked at that yet, yeah.
>>: I'd like to hear a bit about establishing a baseline using the [indiscernible]
practice example. How do you set a baseline across these high frequency, low
frequency, you know, entities?
>> Alan Ritter: Yeah, so we basically just count in our corpus, like for each entity
how many times has it been mentioned, you know, as long as, you know, we have
data for it basically, right?
So we basically count the number of times each entity is mentioned and the number
of times each date is mentioned, and then we can look at the number of tweets that
mention both of them and then just apply a standard statistical test, like, you know,
chi square.
We use like a G test. I mean, ideally what you do is Fisher's exact test, but with the
amount of data we're working with, it sort of -- it's -- you know, there's some like
floating point overflow I guess basically that happens, but ->>: Twitter, how frequently do you find you have to really set that baseline since
there's a merging trend [inaudible]?
>> Alan Ritter: That's a good question, yeah. So we haven't really -- I've just kept
the same -- I haven't reset it, but you probably could. It might actually be worth
doing. I haven't really looked into that, though. Good question.
>>: So have you looked at the -- what kind of errors that you have seen in the
[inaudible] processing, how do these errors affect the final output?
>> Alan Ritter: Oh, for the calendar, you mean?
>>: Yeah, for the calendar.
>> Alan Ritter: Yeah, yeah. So there's -- I mean, there's a lot of segmentation
errors, to be totally honest. So like movie names specifically are really hard
because they're often sort of short phrases, right?
So often you'll see -- I mean, gosh. I'm having a hard time. Like, you know how
you have sort of "Dumb and Dumber" or something as a movie title, right? You
might just get like "dumb" as an entity, right? Instead of, you know -- I mean,
they're really hard to distinguish.
>>: To what extent do these errors affect the final calendar output?
>> Alan Ritter: Yeah, I mean, so basically in that case, for example, you would see
just like, I mean, just the name that you're displaying would be incorrect, but still,
you can kind of click on it and drill down and kind of see what happened.
Yeah, I mean, I should -- you know, the calendar application, I think there's a lot of
-- you know, I could have spent a lot more time engineering this and making it
better. It's just kind of a -- there certainly are some errors there for sure. Yeah.
>>: I'm curious. So Twitter's a domain where people do some amount of manual
tagging with hash tags and things like that?
>> Alan Ritter: Right.
>>: And on the one hand, you could treat those as just another word which
happens to, you know, cross a lot of things, but maybe you want to treat them
specially because there's some, you know, tech, user tech going into picking the
same thing, and I'm wondering what your thoughts are.
>> Alan Ritter: Yeah, no. That's a great question. So I haven't -- I mean, I've just
treated them just like another word so far, but I think there is some -- definitely
something to get out of them for sure. So I mean, you know, some of them are -right.
So in some cases they're really useful and they sort of, you know, really give you an
anchor on the event, and in some cases you see things like, hash, "I like bacon" or
something like that, right? But so they're -- I don't know.
I mean, I think there's definitely something interesting to be done with the hash tags.
I haven't really figured out what it is yet, though. I mean, I've seen some interesting
work on trying to segment them into words and things like that, right?
But, yeah. I mean, definitely like for conferences and things, you know, it will sort of
like give you a nice focus group of all the tweets on the specific event. I think -- in
some sense I think they're almost kind of more useful for people just to ->>: Maybe [inaudible] because sometimes you're trying to extract like a human
readable label for a group, for instance, so distribution amongst hash tags, even if
they're not perfect coverage, might be helpful to people because they're kind of
engineered by people.
>> Alan Ritter: Right.
>>: There are other ways in which ->> Alan Ritter: Or even if you just take all the hash tags, you know, if I tell you I
want to know about this particular hash tag and then just get all the tweets that
mention that and then, you know, one interesting question is sort of how can you
summarize all the information that people are talking about out there in a short,
easily readable way kind of, you know?
>>: Or if they could be used as a supervised labels to kind of like what you see, you
get.
>> Alan Ritter: Right. Find other tweets talking about the same event that aren't
with the hash tag, yeah. It's a good point. [applause]
Download