>> Bill Dolan: So I'm really thrilled today to welcome Alan Ritter who's visiting from all the way across the lake, University of Washington, where he's a student with Oren Etzioni. He'll be graduating this June, I think, with his Ph.D. Alan is well known to many of you from his two stints here as an intern, first in the machine learning group back in 2008 working with Sumit Basu, and then again in 2009 in the NLP group here with us. Alan's work with Sumit in 2008 won Best Student Paper Award. This was working on interactive machine learning techniques at IUI. Alan's work is focused -- his thesis work is really focused at the intersection of Natural Language Processing, social media, machine learning, and he's got some incredibly cool results to show you, I think, today. He has a Twitter specific tool kit. He's worked a lot with big raw noisy data like Twitter trying to extract signal from it, trying to extract structured information that can be used to enable new applications, and his tool kit has been used by many people. He's got it distributed on the Web. I'll turn it over to Alan and let him talk about his exciting research agenda as he finishes up. Thanks, Alan. >> Alan Ritter: All right. Yeah. Thanks, Bill. Yeah, I'm really excited to be here and please feel free to ask questions. Okay. So the Internet's really changed the way that people communicate, and this has led to a huge amount of informal text that's available in electronic format. So this includes things like e-mail, SMS messages, Twitter, and people are also writing informal text in professional environments, for example, physicians' notes and military field reports. Okay. So from my perspective, the really exciting thing about informal text as compared to formal text, say books or newspaper articles, is that the amount of it that's written each day is simply much larger. So just to make this a little bit more concrete, if we're to look specifically at Twitter, if we take all the tweets that are written in just a single day and print them into books, we'd have a pile of books that's as high as some of the tallest buildings in the world, and so clearly, no person can read through all this text, and this is really why we need some sort of automatic text processing techniques to extract and aggregate and organize this information. And so there is actually a lot of important information that shows up first on Twitter. So one famous example of this is is this Twitter user who happened to be in Abbottabad and live tweeted the raid that killed Osama bin Laden. And so, of course, this user really didn't know what was going on, but it turns out that also the first news that Osama bin Laden had been killed showed up on Twitter. So there is this guy who is a high-ranking official at the Pentagon at the time who leaked the information on Twitter before it showed up in the press. So people were talking about it on Twitter first. Okay. So, of course there's been a ton of previous work on Natural Language Processing and information extraction which is focused on processing news articles, and I think this actually makes a lot of sense because historically news has really been the best source of information on current events. And current events are a really good application area for information extraction, so if our goal is just to extract some historical information or kind of encyclopedic knowledge, it's really difficult to compete with the structured data sources like Wikipedia and Freebase. But news is also challenging for NLP applications for other reasons, in part because it's just already pretty well organized to begin with. It's not that hard to just sit down with the newspaper and get a good overall view of what's going on. So in the meantime, social media has become a really important competing source of information on current events, and so the status messages people are writing on these social networking Web sites have very different characteristics from traditional news articles, so they're short, they're easy for anyone to rate, and they're instantly and widely distributed. So because of these reasons, they often contain fresher information on a wider variety of topics than news articles. But of course, this lowering of the barrier to publication is kind of a double-edged sword, so because these messages are so easy to rate, we get a lot of irrelevant information, people talking about what they ate for breakfast, and there's also a lot of redundancy, so we get many people all talking about the same thing. And again, this leads to a situation of information overload and motivates, you know, why we need automatic text processing techniques to extract and aggregate information from this big, noisy text dataset. Okay. So when we look at applying NLP and information extraction techniques to Twitter, there's a number of challenges that come up. So for instance, there's a huge amount of lexical variation, so Twitter users are really creative in their use of spelling and abbreviations. And just to give an example of this, I ran a distributional word clustering algorithm on a large corpus of tweets, and so here you can see there's over 50 different ways that Twitter users can refer to the word "tomorrow." Okay. So secondly, tweets have really unreliable capitalization. >>: How do you know to add tomorrow? Do you analyze the text syntax? >> Alan Ritter: Yeah, you know, that's a good question. So we're basically just looking at the context the words co-occur in, so they tend to occur in similar contexts. That's what the distribution algorithm tells us, but I think these look pretty reasonable, yeah. All right. So in terms of capitalization, you know, users will capitalize words just for emphasis or they'll often leave the whole message lower cased, and this is pretty challenging for this named entity recognition task, you know, which is a standard NLP task and for information extraction, you know, traditionally for news articles, at least, named entity recognition relies heavily on capitalization, which isn't as reliable here. And then finally, tweets tend to have a unique grammar, so users will often drop personal pronouns, for example, assuming that the subject of the sentence refers to the speaker, and you just don't see these same kind of sentences in news articles. Okay. So you might be wondering at this point, you know, how do off-the-shelf NLP tools do when we apply them to Twitter. So I'm just going to walk through kind of a standard NLP pipeline here going through this example and show where some errors come up. So first of all, the part-of-speech tagger thinks that the word "yes" is a proper noun, which is a pretty big mistake, and you can kind of see why it's you know, capitalized and it's this funny spelling, so it's probably out of vocabulary. Then the chunker missegments "Its official Nintendo" as a noun phrase, which is another big mistake, and the named entity recognizer missegments "America" as the location, whereas if you look carefully, you'll notice it should have really been North America. And so I'm not even highlighting all the mistakes here, but the point is that Twitter has this noisy and unique style which these tools designed to work on grammatical text were just never meant to handle. >>: [inaudible] Do you use this? >> Alan Ritter: Oh, so this is like a state of the art, like off-the-shelf tagger, so these are from the UIEC group, actually. Yeah, so we also -- >>: It's not trained [inaudible] ->> Alan Ritter: No. Yeah, it's trained on these data, yeah, yeah. Right. So of course, yes. I mean, to deal with this we've rebuilt an NLP pipeline which is trained on in-domain Twitter data. And so, you know, the main approach we're taking here is a supervised learning approach, so I basically just went and annotated a bunch of tweets with part-of-speech tags, shallow partial chunk tags, named entities and events. We're also using some semi-supervised techniques, for example, using these unsupervised word clusters as features, and, you know, I think there's a lot of interesting work on unsupervised learning for syntax, but we're really trying to take a practical approach here and get something working on Twitter, so that's why we're taking this supervised approach. So I think there's actually a lot of room for interesting work in unsupervised learning for these more semantic level tasks, and I'm going to talk about these later in the talk. So we've done some work on, you know, named entity categorization, classifying the events, and also unsupervised relation extraction. Okay. So here I'm showing the performance of our NLP tools that the shallow syntactic annotation tools compared against off-the-shelf tools which are tuned to work on newswire text, and you can see that in each case we're doing much better than the off-the-shelf tools. We've made these tools available on GitHub, so you can go download them and use them if you're interested, and we've actually found that there's a relatively large number of people that are finding them useful. Okay. So given that we have access to these tools which are tuned to work on this Twitter data, a natural question is, you know, what can we actually do with them. So to give one answer to that, I've built a system which is automatically extracting a calendar of popular events coming up in the near future. So to do that, we're continuously processing a stream of about 2 million tweets per day, running our NLP tools on them, and extracting, for example, named entities and events. In addition, we're also extracting and resolving temporal expressions. So, for instance, if we see a phrase like "next Friday," we can actually figure out the calendar day that's referring to. So then we can just count the number of times that each named entity co-occurs with the reference to each date and use the statistical tests to determine whether there's a strong association there and plot the most strongly associated events on a calendar. Okay. And so I also want to highlight briefly here there's a number of systems building issues that come up when we try to do this. So, you know, because we're processing so much data here, we found that just using a standard relational database just couldn't keep up with the number of inserts. Instead, we had to move to using this distributed NoSQL database. Okay. So I'll go ahead and show a demo of this, and this is available online. I'd invite you to go check it out. So today is, you know, April 2nd, so looking ahead to Sunday, you can see there's a lot of people mentioning something about the FAA. We can click on that to drill down and get some more detail. And so here I'm just showing all the tweets that mention FAA in addition to April 7th, and you can see that they're basically shutting down 173 air traffic control towers this Sunday. And of course, this is due to the sequestration. So looking a bit further ahead, you can see that next week on Saturday the 13th, Mubarak, who is the former president of Egypt, has a retrial scheduled, and then on the 14th Venezuela has new presidential elections. And so there's a lot of other stuff on here, as well. I'd invite you to go and kind of check out. >>: All character or only news part? >> Alan Ritter: Yeah, it's from all the categories. So we do do some language identification to filter out only the English tweets, but, yeah. So this is kind of open domain. It's on kind of all different events and sources. >>: What is the key words said ->> Alan Ritter: Yeah, right. So this is -- these are these event words that we're extracting, so we've annotated some data to train an event phrase extractor. So this is kind of similar to the Timebank corpus, if you're familiar with? So we're doing sort of a similar extraction task on Twitter. >>: How do you rank the topic? >> Alan Ritter: Yes, that's the statistical test, right? So we basically look at how frequent is the entity and how frequent is the date and then how often do they co-occur together, right? So if we just go straight directly by frequency, then, like, Justin Bieber would be the most frequent thing for every day basically, right? >>: How did you filter out those information like [inaudible], probably more things like that. >> Alan Ritter: Yeah, so this kind of comes out naturally from the statistical test, so there has to be like something that's happening really strongly associated with this day. So if people mention what they're eating for breakfast every day, it will -- you'll have to see it mentioned very frequently to kind of overcome, you know, the baseline rate. Okay. All right. So I think this just provides some motivation for why we want to do information extraction and Natural Language Processing in Twitter, but -- so we were actually pretty surprised to see that this worked so well. So if you try to do something similar with newspaper articles, it's actually a pretty difficult task. So just to kind of explain why that is, for example, in this instance to figure out when the bomb attack mentioned in the first sentence takes place, we have to recognize that the blasts mentioned in the second sentence is referring to the same event and also that the blast takes place on Saturday. And then to make things even more complicated, we have to further recognize that this other bomb attack, which is mentioned later in the article, is referring to a totally separate event which happened on a different day. So in order to kind of link together all the information and news articles, we have to solve these discourse level processing tasks which link information together across sentences, and these are some of the more difficult and, I would say, unsolved problems in NLP. There's a lot of interesting research going on in this area, but we don't really know how to solve these problems in the same way we know how to take an individual sentence and process it. So in contrast, tweets tend to have really simple discourse structure. So users on Twitter sort of say things in really straightforward and in compact ways, and to kind of understand why this is, if you imagine a user writing a message on Twitter, they typically assume that it's going to get mixed up in the feeds of all their followers and so they don't assume any context that it's going to be understood in. In contrast, the sentence in a news article is really meant to be understood within the discourse context of the article. Okay. And so the point to take away here is that by working with these short informal messages on Twitter, we're able to sidestep some of these complicated discourse issues. Okay. So given that we can do a pretty good job of extracting open-domain events from Twitter, a natural question for us was whether we can categorize them into high-level types, for example, sports events, political events, product releases, and so on. And so this would have a number of benefits. Probably the most obvious thing here is that it would allow users to browse more customized calendars which match their interests. So there's a number of challenges that come up when we look at categorizing events on Twitter, so the main thing is that there's just a huge number of different types of events that people can talk about, and in advance we're really not even sure what the right set of event categories is. So furthermore, the set of important types might actually change over time as new topics become more or less important, Or if we want to focus in on a specific group of users, there might be a different set of categories which best describes the data. Okay. So to address these challenges, we're proposing an unsupervised approach to event type induction which is based on topic models, and this is actually based on some work we've done on modeling selectional preferences with topic models. And so this approach has a number of advantages. It allows us to automatically discover an appropriate set of event types which match the data. We don't need to annotate any individual events in context, and we don't need to commit to a specific set of event types in advance like we'd have to do before annotating data. Okay. So how do these generative probabilistic models work is we first make up a story about how our data was generated, which involves hidden variables and probabilities, and then given a fixed dataset, we're going to apply Bayesian inference techniques to infer values for the hidden variables, which will then tell us the category of each event in context. So I'm just going to walk through a really high-level explanation of the generative story for our model here, just to kind of make it clear what's going on. So we'll start out by grouping together all of the tweets which mention the same event phrase, so in this case all the tweets mention the word "announce." Okay. And so then we're going to have a set of event types, so each type will have an associated probability distribution over a named entities. So for example, for product releases, we might see entities like Microsoft, Samsung and iPhone, and then similarly we'll have a type for, you know, politics and sports. So then each event phrase will have a distribution of release types. For example, you can see that announcement could be part of a product release or a political event or a sporting event. And then to generate the named entities in our data, we'll just repeatedly draw types from this distribution and then generate the named entities in our data based on the -- you know, from the associated types. Okay. And so this just -- oh, yeah. >>: I don't understand what you meant by the event, so you just take the verb, if the verb matches? >> Alan Ritter: Yeah, right. So one of the -- part of the NLP tools that I kind of glossed over a little bit was extracting phrases which refer to events, right? >>: Okay. >> Alan Ritter: So we basically annotated data using the same guidelines as like the Timebank corpus and then so, you know -- right. So it could be like verbs or nouns can refer to events. You know, you could have sort of like attack or, you know, attacked, right? >>: So the event -- but are they prescribed event types or are you clustering to create the event types or ->> Alan Ritter: Yeah, we're clustering to create the event types, right. >>: Okay. >> Alan Ritter: Yeah. Okay. Anyway, so this just kind of describes the generative story here. In practice what we'll actually do is to apply Bayesian inference techniques which will then give us reasonable values for the hidden variables, which will then tell us what the types are and also tell us the type of each event in context. Okay. So we gathered about 65 million of these entity event date tuples, which are basically the same thing we're showing on the calendar, and for inference we're using collapsed Gibbs Sampling, which is actually a pretty standard approach to inference in these types of models. In practice, we actually use this parallelized approach to Gibbs Sampling, so I should mention that Gibbs Sampling is really an inherently sequential procedure, but there is some theory that explains that the parallelization can be understood as an approximation to the sequential sampling, and we actually found this to work really well in practice and it lets us scale up to much larger datasets. >>: So by parallelizing, you mean you have different chain of ->> Alan Ritter: Yeah, yeah, you run a separate chain on each machine and then at the end of each iteration basically, you know, synchronize the counts. >>: [inaudible] >> Alan Ritter: Yeah, that's a good question. So that's basically -- right. So the run time here is sort of like order the size of the corpus times the number of types. So this is kind of as many as we could do in a reasonable amount of time. Yeah. So of course there's, you know, there's work on like nonparametric models that try to find the right set of types. I think for these, if we're just trying to find event types, the number of types kind of doesn't matter that much. If you're looking at something like co-reference resolution, then getting the right number of types is really important and maybe the nonparametric approaches are more appropriate in that situation, but, yeah. >>: Can you go back to this Bayesian work model? I want to see exactly which state corresponds to what kind of ->> Alan Ritter: Oh, this part, you mean? Yeah. Right, right. Totally. So this is -- I kind of gave a little bit of a simplification there. So basically -- right. So here we have the -- for each type it has a distribution over named entities and then also a distribution over dates on which events of that type occur. And then the theta up there is the distribution over types for each event phrase, right? If that makes sense. So maybe I can use the laser pointer. It will be a little easier to ->>: And the clustering that you did, [inaudible] where does it go? >> Alan Ritter: Yeah, so basically -- right. So these hidden variables here, if we infer values for these, these will basically tell us the category of each event in context, right? And it will also tell us, you know, if we just -- basically in the inference, you know, we do this collapsing, so we integrate out these parameters, and then also these parameters, so all we're really doing is inferring values for these, which then you can kind of read off, you know, what are the ->>: Observation that you have are the words [indiscernible]. >> Alan Ritter: Yeah, so the observations are basically the event phrases which we've -- so we're using a linear chain CRF to extract event phrases and then also to extract named entities, so this is kind of on top of that, right. >>: [indiscernible] this one thing [indiscernible]. >> Alan Ritter: Oh, oh, oh. Right, right, right. Yeah, so these are the dates. >>: [Inaudible] >> Alan Ritter: Yeah, so this basically -- right. So we also extract and resolve these temporal expressions, so I kind of -- I'm skipping this in this kind of higher level story. But so basically each type of event has a distribution over dates on which events of that type happen, too, so this kind of helps to group together tweets which are referring to the same event, which, of course, have the same type. >>: So the [inaudible] observation really just not the word, it's the [inaudible]. >> Alan Ritter: Exactly, yeah, uh-huh. >>: So what's the resolution of the date? So is it like day or month or week? >> Alan Ritter: Yeah, so we just went with just days, right? So you could imagine someone says like, oh, at 8:00 o'clock on Thursday I'm going to do this. But for the Twitter data it seems like most events that people are talking about have just -- they just give you the specific date on which it happened. >>: So you mean for each event type you have a distribution or [indiscernible] 365 days? >> Alan Ritter: Yeah, or just all -- more than 365, so it could be at any time, yeah. Right. So this is a little bit counterintuitive and we tried doing it with and without this, but basically the effect that ends up having is grouping together events that happen on the same day. So maybe on this day there's like a really big, you know, sports event that's happening and so ->>: So you showed earlier on all this, you know, tagger, information extraction are subject to error because of the noise -- >> Alan Ritter: Yes, yes. >>: So I hear the solution will be that that parameters on those model ->> Alan Ritter: Yeah. >>: -- would be part of that model so you can have end-to-end learning to correct those errors? >> Alan Ritter: Yes, totally. >>: You separate them. >> Alan Ritter: Yeah, I'm doing it separately. I think it would be interesting for future work to look at a joint model here. So I think the advantage of this approach is that the inference remains pretty simple and so because we're working with a lot of data, it lets us scale up to a lot of data. I mean, I think that it's interesting for future work to try and do an end-to-end joint approach, yeah. >>: I just want to make sure I understand. So the under event types have been clustered ahead of time before you've computed -- before you've done this inference, right? So you do that and then you create the model and you infer the distributions events using whatever over the event types or is the cluster happening during the ->> Alan Ritter: Yeah, the clustering is happening during the inference, yeah. >>: The byproduct of the ->> Alan Ritter: Yes, yes, that's correct, yeah. So right. So basically we just find for each named entity sort of like what event type does it have and then at the end we can kind of read off the clusters from that, right. >>: But you are limiting it to, like, to 100, so there's somewhere in the inference in the [inaudible] process where it's re-clustering sort of on every -- it's doing like an iterative re-clustering thing? >> Alan Ritter: Yeah, yeah, yeah. Right, right. So basically the Gibbs Sampling procedure, how it works is we basically go through all the data and then sample a new -- for each hidden variable we just sample it, you know, as we're going through the data, resample a new value for it and then -- yeah, and do that through the whole data a couple times, basically, or you know, a thousand times. Okay. All right. So anyway, so these are some of the event types that are automatically inferred by our model and I think these look pretty good. So for example, we have like a sports type here where we see event words like "tailgate," "scrimmage," "tailgating," "homecoming," and "regular season," and then entities like ESPN and NCAA, Tigers and Eagles. Oh, and I should mention, by the way, that these labels are just kind of my interpretation of what the events are. These aren't automatically generated. Then we also get a nice TV event here where we see event words like, you know, "new season," "season finale," "new episode," and then we're seeing some TV shows like Jersey Shore, True Blood and Glee, and also TV networks like HBO. Okay. And so we also did an evaluation where we looked at the ability of our model to actually categorize events in context, and so I manually annotated some data with the event types which were automatically discovered by our model, and so here we're comparing against supervised classifier as a baseline, and basically the point here is that by using large amounts of unlabeled data, we're able to do better than the supervised baseline. >>: How did you [inaudible] supervised, because these were the inferred categories. >> Alan Ritter: Yes. >>: So you're trying to make your own interpretation and ->> Alan Ritter: Yeah, yeah, so basically we run inference in the model. It will give us -- it will automatically infer some categories, right? And then basically I'll use the same categories that the model kind of automatically inferred to annotate some separate data, if that makes sense. >>: To some degree then this is a test of how well you understood the underlying categories that were inferred? >> Alan Ritter: Yeah, that's true. It's a little bit -- I mean, I'll agree that it's a little bit -- right, it's a little bit weird. I mean, I see your point, but, I mean, I think there's some advantage to like just automatically finding the right set of categories for the data. >>: I agree. I was just wondering [inaudible]. >> Alan Ritter: And I mean, right. Right. So I mean, I think it's kind of -- it is a little bit odd, I'll say, to say like, oh, we have this unsupervised model that's doing better, and I think part of the reason why people have, you know, found that unsupervised -- that supervised models tend to do better is because what the unsupervised model is finding doesn't really match up with your idea. So this is kind of like I'm going to, like, let the unsupervised model find something and then I'll, like, use that to annotate the data with. Yeah. That's basically what's going on here. Okay. Right. So this unsupervised approach to information extraction has a number of advantages. So it lets us scale up to large unlabeled datasets. We don't need to specify the right event categories in advance. But I think there's an interesting question here which is, you know, what to do in the situation where we have access to large amounts of structured data, for example, from, you know, Freebase or Wikipedia. And this is actually the case in this next task that I'm going to talk about, which is named entity categorization. So here it's pretty easy to get large lists of named entities and their types from these structured data sources. Okay. So there's a number of challenges that come up in named entity categorization in Twitter, so there's a huge number of different types of named entities that people are talking about. They're talking about, you know, bands, movies, products, and so on. And many of these are going to be relatively infrequent in the data. So even in a really large manually annotated dataset, there's going to be few examples of these infrequent categories. So because of this, I think we can't simply rely alone on unsupervised learning alone here. Okay. So the second thing that's challenging is that tweets are very terse often, so for example, in this instance it's really hard to know what type of entity KKTNY is referring to without some additional background information. Okay. So to address these challenges, we're proposing a weakly supervised approach to named entity categorization which uses large lists of named entities and their types gathered from Freebase as a distant source of supervision. So of course, we can't simply just look up a named entity to figure out its type and context, so for instance, if we look up "China" in Freebase, we see that it could refer to either a country, there's also a band called China, there's a number of different people whose name happens to be China, so we need some way to disambiguate between these possibilities. Okay. So do that, we're proposing a new approach to distant supervision which is based on constrained topic models, and so like I mentioned or like I kind of alluded to in the previous slide, just applying distant supervision directly to this task doesn't work because the training data is just too ambiguous. So instead, we're proposing a latent variable model for named entity categorization which uses the Freebase dictionary as constraints in the model. Okay. And I'll try and make this a little bit more clear on the next slide. So again here I'm kind of showing the high-level version of the generative story for our data, so in this case we're grouping together all the tweets which mention the same named entity, and then each entity type has a distribution over context words which co-occur with mentions of that entity in context. Okay. So, right. So the key difference here is that these type distributions for each entity are constrained based on the Freebase dictionaries. So for instance, if we look up "JFK" in Freebase, we might see that it could refer to either a person or an airport, and then we'll constrain its possible distribution over types based on these possibilities. Okay. And then like before, we'll repeatedly draw types from this distribution, and then to generate the context words in our data, we'll pick them from the associated entity type, type distributions. Okay. And so again, this is just a description of the generative story. In practice, we apply Bayesian inference techniques to infer values for the hidden variables which then tell us the category of each named entity in context. >>: So what's your context? >> Alan Ritter: So basically, right. So -- right. So we just use the previous and following three words and then we also use the words in the entity as well, right? Okay. So here I'm showing some example type lists which are automatically generated by this model, and this is just for three out of the 10 types that we're working with. And these -- I should mention these are also some of the more difficult types. So the easy things are like person and location, but I think these actually look pretty good. So for example -- well, right. So this is -- basically these are the top 20 entities which weren't found in any of the Freebase dictionaries. So these are like words that our model was able to categorize automatically. So for products we're seeing things like Nintendo DS Lite, Apple iPod. There's some segmentation errors in here as well, but, you know, iPod, Nano and so on, but I think what's really cool here is that we're able to correctly categorize some of these Twitter specific abbreviations which you just wouldn't expect to find in Freebase. >>: [inaudible] >> Alan Ritter: Yes, so these are TV shows. I don't know. I hear about all this stuff. I don't -- I don't actually watch these shows, though. I don't know. That's part of the fun of working with Twitter data. Okay. Anyway, so we also looked out how well our model can actually categorize named entities in context, so I annotated a large corpus of tweets with named entities and their types. And here I'm showing our performance compared against a bunch of baselines, including a supervised baseline, which actually does really well on the more frequent types like person and location, but does poorly on these infrequent types where there's few examples in the training data. We also compared against the co-training approach to weakly supervised named entity categorization proposed by Collins and Singer, and you can see that we're actually doing quite a bit better here. Okay. So why is it that LDA is winning in this case? So I think there's a couple of reasons for this. So the first is that it's able to share information about our entities type across mentions in a really nice way. So these -- so basically we can, you know, figure out the right type of the entity in these highly ambiguous cases by looking at the same entity in other contexts. So the other thing is that, you know, because we're using these Freebase dictionaries as constraints in the model, we're just better able to take advantage of this highly ambiguous training data, so we don't have to just rely on these unambiguous cases to learn how to categorize the entities. Okay. Oh, yeah? >>: On the supervised baseline, what did you do for that? >> Alan Ritter: Yeah, so I annotated about 2400 tweets with named entities and their types and we basically just used like a maximum entropy classifier. >>: Is it they using the same context, the three words before, three words after? >> Alan Ritter: Yeah, the same features. >>: Did you use Freebase then? >> Alan Ritter: Did it use Freebase? >>: Yeah, did you take advantage of the ->> Alan Ritter: Oh. >>: -- context of anything that appeared on Freebase? >> Alan Ritter: Yeah. No, I think it didn't look at Freebase at all. So we did also have a Freebase baseline where we basically look up the entity in Freebase and if it's unambiguous, you know, make that prediction. And this actually has really high precision but the recall is pretty low. Okay. All right. So, yeah. So I just talked about this new approach to distant supervision based on topic models, constrained topic models which is appropriate for the situation where you have highly ambiguous training data like this named entity categorization task. And so there was a natural question that came up while we were working on this which is what happens when there's missing information in either of the text or the database. And so the answer is this leads to errors in the training data, and this is a really general problem that affects distant supervision both for this weakly supervised named entity recognition task I talked about and also for the more common application of distant supervision which is extracting binary relations. Okay. So for the sake of comparison to previous work, so here we're looking at the case of binary relations, and so it turns out that most of the work in relation extraction uses a huge number of features which are highly correlated and overlapping, and so these generative models that I've been talking about are kind of not a very good fit for this data because they make really strong independence assumptions. So instead, at this point in the talk I'm going to move on and talk about conditionally trained models. Okay. So this is kind of like what the setup looks like for the -- for extracting binary relations using distance supervision. So we'll start out by having a relation, in this case the born-in relation, and then Freebase is going to give us a large list of people and the locations where they're born. Okay. So basically for each pair of entities here we can go and search through a large text corpus and find all the sentences which mention the entity pair, so "Barack Obama" and "Honolulu" in this case, and then we can basically just treat these as positive examples of the born-in relation and extract features from these sentences and train a supervised classifier. Okay. So this is great, but the problem is is that, you know, what happens if there's some information missing from Freebase. So in this case, all of these sentences function as now a negative training example for the born-in relation, which I think you can see is a problem, and this is actually a pretty common scenario. So there's actually a lot of information missing from these databases, and that's kind of the whole reason why we want to extract information from text in the first place. Okay. So before I get into the solution of how we're going to deal with this, I'd like to just walk through a model for distant supervision in the context of extracting binary relations. So we'll start out with a pair of entities, "Barack Obama" and "Honolulu" in this case, and then we're going to get to observe all of the sentences which mention this entity pair. Okay. So now we're going to have a classifier which is going to predict for each sentence what relation it mentions between these pair of entities. So unlike the standard supervised learning setup, we're not going to actually get to observe these variables during training. Instead, we only get to see these aggregate level variables which tell us which relations hold between Barack Obama and Honolulu and aggregate. So the question here is how do we relate these aggregate level variables that we get to observe with the hidden sentence level relation mention variables. And so one answer to this question is a simple deterministic OR function, and so basically what this is saying is if any of the sentences mention that Barack Obama was born in Honolulu, then this fact is true. If none of the sentences mentions it, then it's false. Okay. And so we can then tune the parameters of this model by just maximizing the conditional likelihood of the observed facts in Freebase conditioned on the observed sentences in the data. Okay. So for learning here, we're taking an approach based on the structured perceptron, which is an iterative gradient based update to the weights. In addition, we're taking an online approach to learning, which just means we update the parameters after seeing each pair of entities rather than going through all the data and doing batch updates. Okay. So this is what the gradient looks like. It's just the difference between these two expectations over the features. And so in practice, these expectations are too difficult to compute, so instead, we approximate them with maximizations. So basically what's going on here is we have two inference problems, one where you want to find the best assignment to the sentence level hidden variables conditioned on the observed sentences and facts in Freebase, and the other case we just want to find the best assignment to the sentence level hidden variables given the current parameters but ignoring Freebase. Okay. And so the unconstrained inference problem is totally trivial, but the constrained problem is a little bit more complicated, but it turns out that it reduces to this weighted edge cover problem, which we can solve exactly in polynomial time, so this works out pretty nicely. Okay. So there's two assumptions that are being made here. So if a fact isn't in Freebase, we can't extract it from any of the sentences; whereas if a fact is in Freebase, we have to extract it from at least one sentence. And so these assumptions are good because they help to drive the learning, but in the case of missing information, either the text or the database, they lead to errors in the training data. So how might we modify this model to more gracefully handle the situation of missing data? So what we're proposing to do here is to take these aggregate level hidden variables and split them into two parts, one which represents whether a fact is mentioned in the text and the other which represents whether it's mentioned in the database. And so then these factors between the two variables are going to encourage but not require that they agree with each other. So now you can see that the facts in Freebase are acting like soft constraints, whereas before they're like hard constraints. So for example, now it's possible to extract a fact that's not in Freebase if the local classifier is highly confident, but of course, we're going to have to pay some penalty for doing that. Okay. So the learning is pretty similar to before. So the only difference here is that we're now maximizing over these additional aggregate level hidden variables that we've introduced. And it turns out this doesn't make any difference for the unconstrained inference problem, but the constrained inference problem gets a little bit more difficult, so it no longer reduces to this weighted edge cover problem in a nice way like we had before. Okay. So of course, the question here is how are we going to solve this inference problem. So again, the goal is to find the best assignment to the sentence level hidden variables conditioned on the observed sentences and facts in Freebase, and like I mentioned, this is just kind of an optimization problem with soft constraints. Okay. So basically what we found here -- so right. So we looked at a couple different approaches. So we looked at some exact inference approaches like A star which are, you know, time and memory intensive and so don't really scale up to these really large datasets we're working with. But we found that a local search almost always finds an optimal solution, so long as we use a carefully chosen set of search operators that are designed so that it doesn't get stuck in a local maxima. And so to verify that we're finding optimal solutions, we can compare it against the solution found using A star, and we found that in over 100,000 problems from our actual data, we only missed an optimal solution three times using this approach. And so this is nice because it's fast and memory efficient and it almost always finds an optimal solution. Okay. So of course, the real question is how does this affect the performance in terms of precision and recall, and so the answer is it actually makes a big difference. So here I'm showing precision and recall curves on the sentence level extraction task comparing against human annotations from the data. And so the red curve here is the system which is using hard constraints. By simply relaxing those as soft constraints and setting two hints at parameters in the model, we're able to get the black curve, which is actually a huge improvement, and then by incorporating some additional information in the form of a missing data model, we're able to do even better, which is the green curve here. And so I think people realize that these distant supervision models are making some bad assumptions about the data. I mean, all models have to make assumptions, right? But I don't think they realize there's this much room for improvement by better modeling the data in this distant supervision problem. Okay. So I'd like to just pause for a minute here and summarize what I've talked about so far. So I presented an analysis of the challenges in applying information extraction to noisy text, I talked about our NLP tools we've adapted to Twitter, and these are available online. You're welcome to go and use them. I showed this demonstration of a system I've built which is automatically extracting a calendar of popular events coming up in the near future. And then I've presented three different probabilistic models for unsupervised information extraction, one which is doing unsupervised event categorization. I also talked about this new model for distant supervision using topic models which is appropriate for the case of highly ambiguous training data; and then I also showed this recent work we've been doing on modeling missing data in distant supervision. Right. So I'd like to spend just a little bit of time mentioning some other work that I've been doing during my Ph.D., so I've been collaborating with some folks here at Microsoft Research, Bill and Colin, who's now at NRC, on modeling conversations in social media. So in addition to talking about popular events, users of these social networking Web sites are having conversations on a really large scale, and I think this opens up all kinds of new opportunities for data-driven conversation modeling. So for example, we've done some work on unsupervised modeling of dialogue acts and also automatically generating responses to Twitter status messages, and so I'll elaborate just a little bit on this second point. So the approach we're taking here is based on statistical machine translation. So in machine translation the task is given some foreign, text we want to translate it into English, and in order to learn a model to do this, we have access to large parallel corpora of paired foreign and English sentences. So I think in some sense this conversational task is actually kind of similar, so, you know, given an arbitrary user utterance, we want to generate an appropriate response to this, and to learn a model to do this, we have access to millions of naturally occurring conversations from Twitter. Okay. And so at a high level how phrase-based translation works is, you know, given an input sentence, we first segment it into phrases, and then we translate each phrase in the input into a phrase in the response, and so this is a little bit different than the machine translation case, but, you know, so there's potentially some reordering here. And to find a good translation, we want both good translations at the phrase level and also a high score according to a language model. >>: Somehow you require certain kind of understanding in order to generate a response, right, not just like -- [inaudible]. >> Alan Ritter: Yeah. No, that's a good point. Yeah. So conversation and translation are two very different problems, so ->>: [inaudible] approach to solve that problem? >> Alan Ritter: Yeah, right, right. So we can't have kind of very deep intellectual conversations. Basically we're learning these kind of high frequency response patterns like, you know, if I see "airport" in the status message, maybe say "safe flight," you know. Or like "I" translates as "you," or like, you know, "dinner" translates as "yum." So these aren't very like deep, you know, kinds of conversations, but, yeah. But, right, so we have a demo of this available online you can go play around with. So for example, "I'm feeling sick" translates as "feel better soon," and this is like end best output, so you can see the other things. [laughter] And, right. So this is kind of cute and all, but I think there are actually some interesting applications here. So one might be, you know, conversationally-aware predictive text input or speech recognition. So assuming that a user just, you know -- or assuming that your friend just sent you a text message and you're typing a response to it using some noisy input mechanism, I think we can actually do a better job of predicting what you're trying to type by taking the message that you just received into account. So for instance, if your friend, you know, texted you saying "I'm feeling sick," we should be able to do a pretty good job of predicting how you might respond to that without even seeing any input from the user. >>: Is that just [indiscernible] mentioned some things in the background? >> Alan Ritter: Yeah, that's a good question, right. So, I mean, the hope here is that by, you know, combining information, you know, actually generating a customized response, we can do better than -- or handle sort of a wider range of different things. I mean, it's -- you know, I haven't done this experiment, right? So it's hard to say. I mean, template matching is probably a pretty strong baseline for this for sure. Yeah. >>: So here the idea is that you use -- you learn this model and then that can serve as a language model to interpret those noisy like speech? >> Alan Ritter: Exactly, yeah, yeah. So you could probably get like the sort of translation lattice out of this or something and combine that with, you know, lattice from, you know, speech recognition or something, yeah. Okay. So then I've also been doing some work recently in collaboration with folks at New York University and actually also with Bill, so -- but on paraphrasing between different language styles. So for instance, you know, we've been looking at paraphrasing Shakespeare's plays into modern English and also, you know, modern English into kind of a Shakespearian style. So the approach we're taking here is to -- we've basically found -- we've scraped a bunch of modern translations of Shakespeare's plays off the Web which we can then use as parallel text to build translation models. And so like, for example, one thing we can do is paraphrase lines from modern movies into Shakespearian style. So for example, "If you'll not be turned, you'll be destroyed" gets translated as "If you'll not be turned, you'll be undone." And, "Father, please help me" is translated as "Father, I pray you, help me." And so there's a demo of this online as well, and again, these are kind of some fun examples, but I think there are actual applications here, as well. So one thing would be educational applications. So it turns out there's only modern translations for 17 out of the 38 plays that Shakespeare wrote, so by translating these other plays into modern English, maybe we can actually make them more available to students, and, of course, there's a ton of other authors from the same time period as well. I think looking at paraphrasing between language style and other domains is also interesting. So for instance, paraphrasing, you know, technical documents into more easily understandable English or paraphrasing between, you know, noisy informal text on Twitter and more formal text. Okay. So for future work, one thing I'm interested in looking at is extracting richer semantic representations of events from microblogs, so, you know, I think people have spent a lot of time working on information extraction in news articles but, you know, there's still a lot -- I think there's still a lot of opportunity to sort of, you know, do a better job of extracting events from Twitter. So one problem here is sort of solving this event reference resolution problem, which is to group together all of the tweets which are referring to the same event. So we haven't really solved this yet. So I think the representation that we're currently extracting is nice because it's open domain, but I think there's sort of more opportunity here to, you know, extract a richer representation. And so also related to that is this problem of schema discovery which is, you know, given these automatically discovered event categories, can we automatically, you know, extract schemas for them. So for example, for a concert, you know, we might expect to see like a music artist that's performing at the concert, also the venue and the city where it's taking place, and then ideally we'd be able to -- like to automatically extract and fill out these templates. And if we can do this unsupervised, then I think we can do it in an open domain way which isn't restricted to a specific type of an event. Okay. So a little bit more of a longer term agenda for future work here, I'd like to look at scaling up grounded language acquisition to more realistic and open-world domains. So I think we're in a really exciting position right now. So we have access to all kinds of realtime text in different languages, and in addition, we have all kinds of realtime sensor data, so for example, realtime data about the weather, you know, traffic, you know, seismographic data. And so I think there's an interesting question here of whether we can link events that people are talking about in text with, you know, events that are showing up in sensor data, and they're kind of temporally correlated with each other to give us some kind of signal there. Okay. So one possible approach to this would be to extend some of the latent variable models that I've been talking about to incorporate both realtime text and sensor data, and maybe by doing this we can ground the meaning of these distributional semantic representations that we've been learning in real-world sensor data at scale. Okay. So I'd like to wrap up and just thank my collaborators, and thanks for coming to my talk. [applause] >>: [indiscernible] So I want to know, so this is a pipeline, right? So [indiscernible] but the problem [indiscernible] they have a lot of duplication. >> Alan Ritter: Right. >>: So basically how important it is to have -- improve your [indiscernible]. >> Alan Ritter: Yeah, I know, that's some good points you're making there. Right. So I think -- right. So in the calendar at least we are kind of exploiting a little bit this effect of redundancy, so if we see lots of people saying the same thing in different ways, that helps to improve the precision rate. Yeah, so right. So definitely I think there's room for improving the performance of part-of-speech tagging and named entity recognition, so, you know, one thing is just annotating more data, like you're saying, some kind of joint model that looks across different tweets, right, is also really interesting. I think, yeah, you guys have been doing some work in this direction, too, which I think is great. But, yeah, right. I think -- I guess my feeling is that probably the performance on these shallow syntactic annotation tasks like part-of-speech tagging and entity named recognition is going to always be lower on Twitter than what we see in news articles just because it's so diverse and noisy, right? So it's kind of more challenging from that aspect, but I think once we get past these noisy text issues, there's actually other things that become easier, right? Because it has this really simple discourse structure like I mentioned. So it's kind of just this interesting domain with different characteristics than what people have mostly focused on, you know. >>: In the work about conversations, it looked like you were focusing on sort of the stimulus response pair. Conversations can go on sort of longer. Did you look at -do a sort of modeling sequence [inaudible]? >> Alan Ritter: Yeah, so we did in a little bit of a different context, so we also had some work on unsupervised induction of dialogue acts, so these would be things like, you know, trying to classify each post, like is it a question or an answer or a status post and things like that. So, yeah, so there we're looking at the sequence in kind of longer conversations. Yeah, for the response generation task we just picked like the first message in a conversation and then the response to that, just because that kind of constrains the problem a little bit, right? But, yeah, doing that in the context of longer conversations I think is interesting, but we haven't looked at that yet, yeah. >>: I'd like to hear a bit about establishing a baseline using the [indiscernible] practice example. How do you set a baseline across these high frequency, low frequency, you know, entities? >> Alan Ritter: Yeah, so we basically just count in our corpus, like for each entity how many times has it been mentioned, you know, as long as, you know, we have data for it basically, right? So we basically count the number of times each entity is mentioned and the number of times each date is mentioned, and then we can look at the number of tweets that mention both of them and then just apply a standard statistical test, like, you know, chi square. We use like a G test. I mean, ideally what you do is Fisher's exact test, but with the amount of data we're working with, it sort of -- it's -- you know, there's some like floating point overflow I guess basically that happens, but ->>: Twitter, how frequently do you find you have to really set that baseline since there's a merging trend [inaudible]? >> Alan Ritter: That's a good question, yeah. So we haven't really -- I've just kept the same -- I haven't reset it, but you probably could. It might actually be worth doing. I haven't really looked into that, though. Good question. >>: So have you looked at the -- what kind of errors that you have seen in the [inaudible] processing, how do these errors affect the final output? >> Alan Ritter: Oh, for the calendar, you mean? >>: Yeah, for the calendar. >> Alan Ritter: Yeah, yeah. So there's -- I mean, there's a lot of segmentation errors, to be totally honest. So like movie names specifically are really hard because they're often sort of short phrases, right? So often you'll see -- I mean, gosh. I'm having a hard time. Like, you know how you have sort of "Dumb and Dumber" or something as a movie title, right? You might just get like "dumb" as an entity, right? Instead of, you know -- I mean, they're really hard to distinguish. >>: To what extent do these errors affect the final calendar output? >> Alan Ritter: Yeah, I mean, so basically in that case, for example, you would see just like, I mean, just the name that you're displaying would be incorrect, but still, you can kind of click on it and drill down and kind of see what happened. Yeah, I mean, I should -- you know, the calendar application, I think there's a lot of -- you know, I could have spent a lot more time engineering this and making it better. It's just kind of a -- there certainly are some errors there for sure. Yeah. >>: I'm curious. So Twitter's a domain where people do some amount of manual tagging with hash tags and things like that? >> Alan Ritter: Right. >>: And on the one hand, you could treat those as just another word which happens to, you know, cross a lot of things, but maybe you want to treat them specially because there's some, you know, tech, user tech going into picking the same thing, and I'm wondering what your thoughts are. >> Alan Ritter: Yeah, no. That's a great question. So I haven't -- I mean, I've just treated them just like another word so far, but I think there is some -- definitely something to get out of them for sure. So I mean, you know, some of them are -right. So in some cases they're really useful and they sort of, you know, really give you an anchor on the event, and in some cases you see things like, hash, "I like bacon" or something like that, right? But so they're -- I don't know. I mean, I think there's definitely something interesting to be done with the hash tags. I haven't really figured out what it is yet, though. I mean, I've seen some interesting work on trying to segment them into words and things like that, right? But, yeah. I mean, definitely like for conferences and things, you know, it will sort of like give you a nice focus group of all the tweets on the specific event. I think -- in some sense I think they're almost kind of more useful for people just to ->>: Maybe [inaudible] because sometimes you're trying to extract like a human readable label for a group, for instance, so distribution amongst hash tags, even if they're not perfect coverage, might be helpful to people because they're kind of engineered by people. >> Alan Ritter: Right. >>: There are other ways in which ->> Alan Ritter: Or even if you just take all the hash tags, you know, if I tell you I want to know about this particular hash tag and then just get all the tweets that mention that and then, you know, one interesting question is sort of how can you summarize all the information that people are talking about out there in a short, easily readable way kind of, you know? >>: Or if they could be used as a supervised labels to kind of like what you see, you get. >> Alan Ritter: Right. Find other tweets talking about the same event that aren't with the hash tag, yeah. It's a good point. [applause]