36454 >> Michael Gamon: Okay. Hello, everybody. Welcome to what's now the 37th of these symposiums. So it's been a while that we've been doing this. We have the usual format. So we have two talks. One from the U-dub site. One from the Microsoft side. And just a little bit of logistics. Drinks and coffee are right out there. Restrooms are this way. If you parked on a visitor parking spot, you should register your car with the receptionist. And I also want to point out there's a whole bunch of lunchboxes in the corner there. So the caveat there is they're not ours. They are free for the taking, but they have been out since lunch. So at your own risk, but the cheese and other stuff is up to date. And, yeah, we're going to start with Ryan and Fei is going to introduce Ryan. And then we're going to move on to the second talk and I'll introduce Emre at that point. >> Fei Xia: So it's my pleasure to introduce our first speaker, Ryan Georgi. He's graduated from UC Berkeley for his bachelor degree, and then he joined U-dub first as a CMS, CLM echelon, and then he continued into the Ph.D. program. So he worked with me and also today, he's going to present part of his dissertation work. >> Ryan Georgi: All right. Thanks. So I'm going to be talking about automatically -- is it on? So I'm going to be talking about automatically enriching interlinear glossed text. And before I get to all of what that means, this work is kind of in cooperation with two other projects at the university, the RiPLES and Aggregation projects. So here are team members for those two projects is the links if you're interested. And we're supported by an NSF grant. So before I get into what IGT is and what it's used for, just some background information. In our field, we've been developing a lot of new technologies and expanding language coverage beyond just English and Arabic and French. But building these new tools often comes with data requirements. So that's annotated data, unannotated data in large amounts for unsupervised methods or both for some supervised hybrid systems. And acquiring this data is neither cheap nor quick. There are some particularly interesting high-coverage projects out there. So the Crubadan Project is a project focusing on low-resource languages. It's got 2,124 languages and the indigenous tweet database by the same author. It has 179 languages, indigenous languages from around the world on Twitter. So that's a very impressive amount of data, even for unannotated sources, because these languages are ones that aren't typically seen very often. That being said, we know of at least 7,000-plus languages known to be currently spoken. So that leaves, even with those impressive coverage results, quite a lot of leeway. So most of these languages are low resource. They're not always large speaker populations or strategically important to western countries that are funding a lot of this. Some of them have as few as 80 speakers. So spending a lot of time and effort developing electronic resources might not be very viable. But still, having these resources could help answer some large scale type logical questions about humans and our languages. So I'm going take a look at how we can approach some of these 7,000 languages programatically with some pretty interesting resources. So one common way of getting -- leveraging of low-resource language to create resources from resource-poor ones is by a projection where you take annotation on one language, align it with another, and project the results, map them 1 to 1. The first problem that presents is how do you achieve high-quality word alignment? Then how do you deal with unaligned words? And finally, how do you deal, if the two languages diverge from one another in how they represent meaning, even if your alignment says that these words are roughly equivalent? So I'm going take a look at way to see address all of these. So what is interlinear glossed text? For a lot of linguists, it's going to be a pretty common site. It's typically used to give examples in linguistic papers. So this is an excerpt from a dissertation on non-macro role core arguments. I don't know what that is. But the German example here is nice and clean. So just to take a look at that a little bit more closely, IGT instances typically have a few interesting properties. They have three lines, so the language, gloss, and translation. The language line is going to be whatever the native language is. The gloss is this interesting hybrid English and morphosyntactic annotation. And then the translation is typically a natural sentence in typically English. So one thing we noticed is that the words on the gloss line and the translation line are often mirrored. So we have Peter, we have children. They're not always in the same place because the native -- the language we're annotating might actually present it in different order. But if we look at the way the words occur on each line, we can infer word alignment. Then, we can subsequently use this alignment to project annotation. So if we part of speech tag the translation line, we can use that inferred alignment. Then, sense the gloss happens to match up one to one with the language that it's annotating, we can just follow the alignment and assign those part of speech tags to the language there without having any annotation directly on the language to begin with. Additionally, talked about the morphosyntactic information there. Here, it's presented in the form of grams. In this case, they're case marking or, well, case annotation, since the morphology is a little embedded and complex there. But that's an interesting little bit of annotation that IGT provides as well. So using IGT, we have a really fascinating resource that Will and Fei have been working on called the online database of interlinear text or ODIN. It consists currently of 3000 PDF documents that have been automatically crawled from the web. 200,000 IGT instances and that covers 1,400 languages. And right now, we're actually in the process of expanding -- oh, and we have 1.3 million new documents. Those were retrieved with a high, what we call, low-precision method. So not all of them necessarily have IGT in it, but even if we're talking about five percent of those being valid IGT instances, we're talking at least one order of magnitude in the number of instances and who knows how many languages that will add to the database. So actually using IGT, start with word alignment. Talked a little bit about how we can do that following the repeat words. And so ODIN has high coverage by virtue of having all those languages, but each language doesn't have necessarily that many instances. So most languages in ODIN have a fewer than a hundred instances in the given language. That's not really enough for typical statistical alignment methods. If you just try doing alignment on a hundred sentences or so that we have there, we found that in our results, they give us about, F score, about 5.56. So instead, kind of look at leveraging the gloss line like described earlier. So I have three IGT instances here. First is Oriya, a language spoken in southern and eastern India. Got Turkish, which is of course spoken in Turkey, other parts of Europe and central Asia. And Yukaghir, which -- or Yukaghir, which is an endangered language in eastern Russia. And this only has about 70 to 80 living speakers. So you'll notice that we have the gram here for third person singular. In Oriya, it's unclear whether that's some kind of clitic or maybe just a pronoun, some kind of agreement marker. In the other languages, it looks a little bit more clear that it's some kind of inflection and agreement on the verb. In any case, we're seeing that same token throughout these very disparate languages. And in addition to the grams, we have all these English words scattered throughout the ODIN database on all these different languages. So while a hundred parallel sentences might not seem like much nor a particular language, 200,000 instances that share this kind of pseudo language of English words and grams that we see in repetition throughout the different languages, that's something you can get some traction on. So if we use the data for all the languages in ODIN, that can actually benefit the alignment for each language since we don't actually need to use the ODIN instances for the language we're looking at in order to have a database or have parallel sentences between this gloss line pseudo language and the translation line. So just to compare some of our results from different alignment methods, the first method is just statistical alignment. If you just run Gija++ on the English and the foreign language for each one of these languages and it's the IGT instances in the ODIN database, the results are pretty, as you might expect, pretty bad, between .5 and .6. If you do the -take all of the gloss lines and all the translation lines from the ODIN database and throw those in, you get quite a big improvement as you might expect from having so much more data. The third method, the statistical plus heuristic there, that is a case where you take the heuristic method, which is where I was talking about the translation line and gloss line having those string matches, if you take all the words that match and throw them into the liner as their own parallel sentences, you get a little bit of a boost, but not much. And then the heuristic method actually works really well. The reason is the recall is kind of a wash for most of the methods but it's just -- heuristic is so much more precise given that it really zeros in on those shared words between the languages or between the gloss and translation. So using the word alignment there, that's really handy, but we'd like to get some more information, like part of speech tags for these languages. Even when we have that high quality word alignment, the projection might still be problematic. So this is actually a case where even with our heuristic method, we don't get a whole lot of traction. This is from Chintang, which is an endangered language spoken in Tibet. And it just happens that [indiscernible] the folks there have a wonderful team where they've gathered thousands of instances of IGT for Chintang, so it's a really great resource we took a look at. The problem with this instance here, you might see, is that despite the gloss being very helpful and felicitous transliteration, of the Chintang data, it really has no overlap whatsoever with the translation. So to try to do our heuristic approach simply won't work. And that's a problem if we're trying to project annotation. So if we use projection alone, we found that our part of speech accuracy was only 12 percent. And that's because the vast, vast majority of the tokens were simply unaligned and had no -- we had no ability to recover a part of speech tag. So that's not that it's bad for all IGT instances but there's still room for improvement. Just to see the results in a couple other languages, it ranges from just one percent in French to 25 percent in Bulgarian. These were just some small instances we used for evaluation. But if we worked correctly on the gloss line using some of the information provided there and bypassed the alignment problems entirely, that can boost our accuracy quite a bit. So to give some information how we can kind of ignore the alignment all together, we've got two more instances here. This is Gwi. I have no idea how to pronounce that either. But it's an endangered language of Botswana that has about 2 to 4000 speakers. And then an instance from Yaqui. That's a [indiscernible] taken language. About 16,000 speakers in the Sonoran Desert ranging from Mexico to Arizona. And again, we see that the one SG gram, the first person singular, is present in all these cases, but it's not always the same meaning. So in the first instances, it's inflection on a pronoun. In Yaqui, it seems to be inflection on a verb. And if you're really [indiscernible] you'll actually notice that in the Gwi example, we actually see the nominate and imperative there so there's actually inflection on the pronoun that determines the mood of the clause, which is just a fascinating little aside there. But the take away is that the gram is a really helpful indicator of what the part of speech tag might be but it doesn't map one to one. If you see first person singular, you don't know for certain what type of -- what the word class is. So instead, if we use those as features on a classifier, we can train it by starting off with labeling a small set of instances with their part of speech tag. We're going to use that as the target label for a classifier. So we start with a couple features. First is the most common tag for English words in the gloss. So we have a place, precipice, do with, and we just add those as features for the label. So the non-English tokens, the grams, you can just use those as is, as binary features, their presence or absence. And then take all those together and use those to train the classifier. Now, when we run the classifier directly on the gloss line versus projecting from the translation to the gloss line, we get a huge boost in accuracy for part of speech tagging. Largely, that's due to the fact that we don't have to worry about unaligned tokens anymore. We can actually just take a guess at the word with our classifier and its features that it fires. So the next step is if we have a classifier that we trained with some manually labeled instances, that is great, that's a couple hours work. Got some good results. But what if we use more of the ODIN data. So projection isn't particularly great. It's got lots of gaps, as we discussed. But it is really precise. So if we project part of speech tags over the 200,000 ODIN instances, and compare that to just the manual labeling on 143 instances, what would that look like? Furthermore, if we transfer the gloss line part of speech tags to the language line, and you use that to train a monolingual classifier or monolingual tagger, how does that compare? So some results for this, and just to step through them, the first thing of note here is that this system here, the manually trained classifier was based on gloss lines that had been annotated for some languages. For Indonesian and Swedish, there was actually no Indonesian or Swedish instances used at all. So this is a result that is using no data other than IGT instances for these languages. The ODIN-based classifier here, these are all results using the projected part of speech tags to train the classifier. And it actually performs better than the manual classifier and there's no language specific information from human intervention whatsoever involved in this system. And then finally, a supervised system's always going to outperform these. These aren't the most stellar results. Going to get somewhere about 90 percent. But if you actually, even using a supervised system that uses about a thousand sentences, which is more roughly in keeping with the amount of IGT data that's going to be available, the difference isn't quite so striking between the systems here and the supervised system. And most importantly, all of these systems here, the ODIN-based systems, can be trained for any language in the database and require no manual intervention. Whereas the supervised systems actually need to have someone sit down and create that annotated data. So finally, we also did some experiments in parsing. We did some dependency parsing and phrase structure. Parsing, we can project those structures just as we did with the part of speech tags and some clever tree manipulation. And the part of speech tagging that we performed earlier can actually help with those parsers when we have some part of speech tag data. A lot of the dependency parsing that's done in this area is typically unsupervised and if you assume that you don't have part of speech tags, it makes the whole process much, much harder. So one other thing that we did with the parsing, in Georgi et al. 2014, we actually used dependency trees. We used languages that we had some dependency trees for, and just a few. Going over those dependency trees and projections, actually learned how those trees diverged. So the projections would come up with one answer based strongly on the English interpretation and the English parse, then comparing those to the trees that we had for the other language, we could actually see some common patterns. Then if we actually applied corrections to account for these divergent patterns, we could improve the accuracy from 55 percent to 75 percent. So some -- just an example of some patterns. Sometimes we'd see multiple alignments in a language. And knowing the general attachment headedness for the language would be would be helpful. If we knew there was left attaching or right attaching from the trees, we could figure out a solution approximate for which word in the multiple alignment would take precedence. The other case was with swapped words, we found that in particular, in the Hindi trees, Hindi is a post positional language and so we would often see the English prepositions being given the head of the dependency parse. And we found in the Hindi tree, that we actually want those switched. So just a few instances was able to get us to learn the divergence pattern. So, finally, what can we do with all this? So projecting the part of speech tags that we arrive at can be used for some interesting typological questions. So basically word order, the morpheme order, and whether or not we have definite or indefinite determiners. And Bender et al. Emily's group, looked at determining case type using the gloss line and the part of speech tags on there. So whether a language was ergative-absolutive or nomative-accusative, that's a question that you can answer by trying to figure out the case markings on the gloss line. Finally, projection is a common solution. Just to wrap up, we looked at how we addressed getting the word alignments using the gloss line and the various heuristics and methods we could use to improve alignment there. Then as for the unaligned words, just skipping over those entirely and focusing on the gloss line to train a classifier can kind of obviate that problem entirely. And then for language divergence, both using it's classifier to look at the gloss line itself and not assuming that it's going to have the same part of speech tag as the English translation, but also learning those divergence patterns from the comparison between the trees. So finally, the actual INTENT package itself is at the core of all this. And that stands for the interlinear text enrichment toolkit. So that's the software that's been to automatically enrich -- do this enrichment. And it projects the dependency and phrase structures. And currently, we're about to release the 2.1 version of the ODIN database with -- not get with the expanded coverage, but with the current 1,400 languages and in addition, INTENT is going to be used to do the part of speech tagging dependency structure and phrase structure projection and word alignment and provide those enriched files in a 2.1 release. And it's going to be using the XIGT formalism, which is a XML version or an XML serialization of IGT data that makes it very nice and easy to add annotation tiers to IGT instances as well as alternative tiers. And so there is a basic web interface too for INTENT that you can play with to create your IGT instances and see how they work and just to show a quick example of that, so this is just a quick little example of Chinese -- I'm told that this is a nice ambiguous sentence in Chinese between Zhangsan does not like anyone and no one likes Zhangsan. So if you go ahead and run this through INTENT, just it will create the XIGT file for you. It's a little intimidating looking, but we're working on it, on a tool to process that and make it easier for humanitators. And then if you -- this just gives you the XIGT output. Then running it through the enrichment, you can also see we get the part of speech tags back. And I think in this case it was actually Zhangsan who/whom all not like. So noun, who/whom, pronoun, all, debatably a determiner, not, adverbial, and then like is -- we realize that the -- like is probably seen more often as a metaphorical add position than it is with warm and fuzzies, given that our database is largely going to be trained on the Wall Street Journal type stuff. So yeah. You can play around with this. The results aren't always going to be the highest quality. But you know, we can use this for any language in ODIN. So we're going to be working on that in the coming months. So any -- I think that's it? Any questions? [Applause] >> Ryan Georgi: Yeah? >> So, have you considered typological constraints in these projections? Like the thing that jumps out to me about the example you just gave is that there's no verb in that sentence. So there should be a verb. >> Ryan Georgi: >> In the Chinese or -- In the Chinese. >> Ryan Georgi: Okay. >> So could you use that typological constraint as a way to do better part of speech projections [indiscernible] between like and like? >> Ryan Georgi: So using the typological constraint, you mean feeding it in the information that there's -- that that's not -- for Chinese, that that's not expressed as a verb, you mean? Or ->> [Indiscernible] that it's a sentence, we hope to see a verb. >> Right. >> Ryan Georgi: >> [Indiscernible]. >> Ryan Georgi: >> Oh, okay. Yeah, yeah, I gotcha. That is a verb. >> Ryan Georgi: Yeah. No. And actually, that's a good question. One of the problems with IGT is that it's not always full sentences. So a lot of -part of reason that this is as kind of messy seeming as it is, and I didn't get into it too much, is that the ODIN database is from PDF documents and this nice neat XML format that we have is something that we developed to store it in but the PDF documents themselves, we actually have to extract it from using PDF to text conversion. So it's all sorts of interesting corruptions from the PDF to text conversion but then on top of that, IGT is also occasionally sentence fragments or sometimes just single words. So it's not always easy to treat it as a stand alone sentence. Although that's a good idea, that maybe defaulting to that would be a good thing to try because probably, most of the instances are going to be whole sentences. Yeah. >> Thinking further about this error, because it's interesting, in this case, you've done the part of speech [indiscernible]? Is that true? >> Ryan Georgi: Yeah. >> But you have a really nice alignment. Presumably, if you look at the source language [indiscernible] the part of speech [indiscernible] -- sorry, the tags for the [indiscernible] translation line did not treat that like it was in that position. >> Ryan Georgi: Yeah. Actually, I think we looked at this the other day and it was -- if I can see it here. Because I think it was the case that the English language tagger -- okay. So this is -- this it here. >> [Indiscernible]. >> We're not seeing it. >> Ryan Georgi: >> Oh. [Indiscernible]. Maybe it actually did -Yeah? >> Ryan Georgi: Oh, it does. Does not like -- yeah. Because -- and this was -- this is where we realized that, oh, wait, where was our part of speech tagger for English trained Penn Treebank, which is Wall Street Journal. So again, they're probably not talking about like, you know, McDonald's warm fuzzies for Burger King. They're probably talking about some, you know, metaphorical comparison. But ->> So in general, could you combine the gloss based part of speech tags with projected ones in the case that you have alignments from that [indiscernible] the heuristic alignment? >> Ryan Georgi: >> So like a back-off, like start -- Back-off for somebody to combine those [indiscernible]. >> Ryan Georgi: Yeah. So there are actually two things that I didn't talk about directly in here that I tried, and I didn't put them because the results ended up not being very promising, but the first is using projection and then filling in the underlined tokens of various methods. One being just let's call everything a noun and see how that does. The other was actually running the classifier on those unprojected tokens. That was one way of combining them. The other way was actually using the classifier and then feeding into it as a feature what the part of speech tag that was assigned to the word aligned with it, if it exists, is. And so that actually didn't -that tended to hurt more than it helped interestingly. I think the feature of just, if you have an English word, what's its most common tag is so strongly predictive that it actually didn't help above that. That being said, the supervision that we -- or the evaluation corpus that we have for this is still pretty small. So a lot of the results and the conclusions we're drawing are still kind of preliminary. So -- yeah. >> Given that there's a large number of pairs of languages, could you explain how the heuristics work for the [indiscernible]? >> Ryan Georgi: So the heuristics that we worked with, and I think [indiscernible] said there's a huge number of pairs of languages, right? Yeah. So I think in our database, there are some instances where the gloss and translation are German or non-English, but the vast, vast, vast majority, we just -- it's so overwhelmingly English that we just assumed that the one of the pair is English. But that being said, what the heuristics entail is typically we'll break apart the morphemes, compare those individually. We'll do some stemming just to see if there's running and ran or some, you know, alternate version of the word. Also compare the gram if we have like a 3SG, just see -- standing all by itself, we'll see if there's a he or she or it fleeting around in the English. Pretty simple things, but putting them all together, we get a pretty good alignment. >> I was working if you could also get a lot of insights from the types of errors that are being made or cases where, you know, the classifier, for example, has low conference. I think that would indicate that, you know, you're dealing with a phenomenon that's either like [indiscernible] or that might be worth more human investigation. >> Ryan Georgi: That's a good point, yeah. I hadn't looked at that. And actually, in finding instances to kind of talk about that case with the imperative inflection on a pronoun actually brought to mind like I really wonder what the classifier thinks of this because that's a really bizarre thing to see on here. I have a suspicion that things like imperative or perfect or anything like that is going to be so strongly weight for a verb that it might [indiscernible]. Yeah, it would be interesting to see the confidence scores on that to see like, okay, what's happening here? [Applause] >> Michael Gamon: So the second talk is by Emre Kiciman and Matt Richardson. Emre is going to present this. This is work that actually was presented at [indiscernible] this year, right? >> Emre Kiciman: >> Michael Gamon: >> Emre Kiciman: >> Michael Gamon: >> Emre Kiciman: It's in submission at dub, dub, dub. Oh, it's in submission. Oh. And it was presented at KDD. KDD, okay. [Indiscernible]. >> Michael Gamon: So there's also -- there's a part here that has been present but there's also a part that is entirely new. And to introduce Emre, I mean, I have the good fortune to be able to work with Emre on a number of projects involving social media, and that's an area that he's very interested in. And the talk toed is to -- how to actually discover outcomes from social media, from the language that we find there. >> Emre Kiciman: Thanks very much, Michael. Hi, my name is Emre, Emre Kiciman. And I want to talk to you today about some work that we've been doing here. Like Michael said, with Matt Richardson, we started this project a while ago and we had a paper out about some of the techniques that I'll mention today at KDD over the summer. And the rest of the work is in submission at dub dub dub right now. And we did that work over the summer with Alexander Olteanu from EPFL and Onur Varol from Indiana University while they were interning here at MSR over the summer. So what this project is about. Why do we care about what happens? Every day, we all decide something. We all decide to take some action. We're in some situation and we want to get out of it. We just want to know what's going to happen. And so we start off by saying, okay, I'm going to pick the thing to do next that is going to be best for me. Whatever my goal is. Whatever I personally want. And this works for some people. Some people say, okay, should I do this? They think it through. They have a good kind of knowledge about the world around them and about the future and then, you know, they pick the right thing to do, turns out great. Not all of us, though, do so well. Some of us, you know, occasionally can't actually predict what's going to happen. We don't really understand the situation very well and we take an action that goes off in the wrong direction and doesn't work out for us. So what this project is about is about saying, well, you know, we actually have a lot of opportunity to learn about what happens in people's life. What happens after people take actions or what happens after people find themselves in certain situations. There's hundreds of millions of people who are regularly and publicly posting about their personal experiences often to the chagrin of the people who follow them and listen to them on Facebook and Twitter. But there is an incredible amount of detail there. And so the question is can we aggregate this and can we learn about what happens in different situations and bring that back into how people make better decisions on their own in their own lives. So our broad goal is to build a system that can analyze really the humongous amount of information here and let us answer open-domain questions about the outcomes of actions and situations. Now, like I mentioned, we did some of this work before with Matt Richardson that -- and a lot of that work focused on how to go from the social media messages to timelines of events that we would then analyze. This presentation is going to be more about how to apply propensity score matching analysis to analyze those timelines that are based on social media posts and then evaluate the performance of that analysis across a wide variety of domains. So let me go into kind of the process that we're doing. First we start off with a number of social media messages and then we're going to extract out some representation of what's happening in people's lives based on these texts. And given some action that we care about, we are going to separate these timelines into two groups. What we'll call a treatment group that we believe actually -- where people experienced the action of the situation that's of interest to us and a control group where people did not experience the situation. Once we have these two groups of users and their timeline, we're going to learn a propensity score estimator. So we're going to figure out basically we're going to learn a function that estimates the probability of someone having this treatment given everything that happened before in their lives. And then we're going to stratify the users based on this likelihood. Now, I'll go into a little bit more detail about what happens when we stratify these users and why we do it. But then once we have this, we'll go ahead and calculate for every outcome that we see happening after this event. We're going to go ahead and iterate and we're going calculate the difference basically between what's the likelihood of someone experiencing this event given that they have the treatment versus the likelihood that people experienced this event given that they didn't have the treatment. We're going to calculate that for each strata that we see and then sum that up for the population. So how do we build out these individual timelines for the things that we're talking about here, the experiments I'm going to talk to you about here? We used English language Twitter data from the firehose, aggregated by user ID. We cleaned these tweets to remove URLs, app mention, stop words, and we applied basically stemming and normalization of common slangs. So for example, we read the letters. Then we extracted all unigrams and bigrams of people that people were mentioning as events in someone's life, and we placed them on a timeline according to the metadata of the tweet. We identified treatment basically by finding users who mention a target token at any point in their timeline, and the control is the users in our data set who don't mention that target token. And in the work at KDD. We actually describe a set of much more sophisticated techniques. We wanted to focus here on the propensity score analysis to we basically tried to remove as much computational overhead as possible for our experiments. But in this -- in our KDD paper, we found that there's a number of things that were important. We applied experiential tweet classifications so we only considered tweets where people seemed to be talking about something they experienced personally rather than just taking unigrams and bigrams, we applied phrase segmentation to take out, find kind of cohesive phrases that people were using. And then we clustered these phrases and then we also applied temporal expression resolution. So if someone is talking about last year or last week I did something, we would shift the events along the timeline appropriately. Talking about the propensity score analysis, I want to do in one slide just a quick introduction to propensity score analysis. How many people here of familiar with counter factual analysis and propensity score analysis? Okay. So I was hoping I could ask someone more questions after the talk, but that's fine. Okay. So we want to measure the outcomes of some treatment versus no treatment. So this is like a classical social sciences, science experiment thing. Right? We have a randomized trial or something and you want to figure out what happens. We don't have a randomized trial. So what we can do is we can do a thought experiment. So ideally, we have an individual who had some effect, so had some treatment, so we're going to say this individual I got some treatment one, and then we have the outcome which is this function Y sub-I of one. And we'd like to see, for this individual, what would have happened if they hasn't taken the treatment. So ideally, if Michael, you know, took some action, we'd like to take Michael and observe him in some parallel universe where he's exactly the same, except he didn't take this action. And then we'd look at every -- basically the differences and outcomes and that would be the actual effect of this action. But we can't measure Michael in two parallel universes. He either takes the action or he doesn't take the action. We only get to observe one of these cases. So instead, what we can do is we would find, say, Michael taking the action. We would take someone who is very similar to him who doesn't take the action and we would basically compare the outcomes here. And if Michael has a twin who is exactly the same as him, then we can estimate what would have been the effect of this action by comparing Michael to his twin who didn't take the action. Now, of course finding someone who else a twin in every important way is really difficult because, you know, in kind of technically, I guess you would say we live in a very -- described by very high-dimensional vectors and the course of dimensionality means that there is no one who is really very close to you to any given individual. So instead, we back off a little bit further and we take a look at population level effects. So we're going to say, we're going to estimate what the outcomes are after people -- after a whole population of people take this action and we're going to compare that to the estimate of outcomes, the expected value of outcomes for people who don't take the action. And as long as these populations are comparable, statistically identical, then this gives you a good expectation on the actual outcomes of this action being, you know, one or zero. Now, how do you get these comparable populations? Randomized trials is one way. When you're working with observational studies, a different way is to then apply something like a stratified propensity score analysis and what this does is essentially splits the original population that you're looking at into multiple comparable subpopulations and each of these subpopulations now because you're stratifying on a function that's taking into account all of their features, basically it ends up balancing the features that were relevant for this action. So in our case, the features of a user that we're going to be balancing on are going to be all of their past events which is the unigrams and bigrams of all their messages that they've mentioned in the past. And just as a, you know, it's worth noting that our control users don't actually have a past or future. They never had a treatment event, so we can't tell when we're going to align them on. So we just pick a random time at which to align them. And so we pick a random time and then their past events we're going to define everything that happened before this random time. So that is, you know, of all the possible times when they didn't take this action, we just pick one. Now, our tasks to learn a propensity score estimator is to learn the likelihood that the user mentions this treatment token. So we basically learn a function, what is the probability that the next word is going to be this target token that we care about. Sorry, that they're going to be -- that they're going to be in the treatment control class given the past events. And in our case, we're training our estimator with an average perception on learning algorithm and we're training this algorithm based on all the timelines that we've extracted. Now, we use this then to bin people and we can then just quickly takes a look at whether the propensity score estimator function is doing a good job. There's two things that we care about. One is the populations are actually matched, the control and treatment populations within a strata should have similar distributions of features. The other thing that we'd like to see is that the propensity score is actually estimating things correctly and this function shows that. And then how do we measure the outcome in these experiments? We're treating outcomes. All of the words that people say after they take the treatment or don't take the treatment as binary values so this is did they ever in the future use the word, you know, Y. And rather than mention that word or they don't. And we're going to then measure the average treatment effect summed across all the strata as the increase in likelihood of them mentioning some event given that they took the treatment versus that they didn't take the treatment. Now, propensity score analysis is borrowed from the causal inference literature but if we actually wanted to make causal claims, we'd require a fair amount more domain knowledge. And the reason is there's two -- you know, there's a fair number of assumptions being made here, but there's two in particular that the propensity score analysis makes that we don't. And generally causal in general makes. First is that we would have to assume that all the important confounding variables are included in our analysis. And, you know, even though social people talk about a lot of things on social media, there's no guarantee that they talk about everything that's important or that everyone talks about everything that's important and so we don't meet that assumption. And then there's also this fun assumption that the outcomes of one particular -- that happen to one individual should be independent of whether other people take a treatment. And in a network environment, that's generally not the case. What happens to you depends on what other people are doing as well. So because of this, we don't say that we're actually pulling out causal relationships between outcomes and these treatments, but we do find having said that, we do find that this analysis gives much better results than simple correlations. So that's kind of fun. So evaluate these technique it's and see whether or not they did well. We picked 39 specific situation and so these situations were chosen from nine high-level topics. We chose these nine domains for diversity. They include, you know, within the business topic, we looked at construction maintenance, people mentioning financial service related stuff or investing in health. We looked at several mental issues, diseases and drugs essentially, and within the category of societal topics we looked at general societal issues, law and then relationships. This category is taken from the open directory project so it's an existing taxonomy that we borrowed. And then within each of these high-level topics, we chose several specific situations that we basically picked at random from searches that people are doing to Bing already. So we want some grounding that we were looking at questions that people cared about. So that's why we borrowed these from Bing. The actual data we analyzed was three months of firehose data. And so what we did was we looked at March 2014. We looked at everyone who expressed taking these actions so that they mentioned they had high blood pressure or high cholesterol or they mentioned they were taking lorazepam or Xanax, trying to lose belly fat, getting divorced, finding true love or cleaning countertop. And then given that they mentioned this in March 2014, we grabbed all of their tweets from the prior month, February 2014, and all of their tweets from the month after in April 2014. Then what we do is we run our analysis. We extract the outcomes that occur after people take these -- do these -- experience these situations and then we also took at the temporal evolution of how people -of how -- when these outcomes occur after people take the treatment. We also got judgments on how good our results were for mechanical Turk. And then we also compared it to existing knowledge basis. I'll give you some examples now of the ranked outcomes and the temporal evolution and then go into the results of our precision judgments across the aggregation of these results. I won't talk about comparing to existing knowledge bases in this talk. So for example, someone mentions gout. What are they likely to mention afterwards. The top ranked issues were the phrases that people mentioned or unigrams and bigrams that people mentioned were basically people mentioning flare-ups, uric acid, uric more generally I guess and fair more generally. Big toe, joints, aged, and then at the bottom, you have bullock and you have kind of people -- so I'm not sure whether there's a specific UK kind of bias in this data. But most of these words you see, you can see are related to gout, which is a disease where uric acid builds up in your body, and causes pain in various joints. And for each of these, we can take a look at the lift. How much more likely are you to say this. So this is the absolute difference in likelihood of saying this so if you were one percent likely to mention some, say, flare-up, now you'd be 5.1 percent likely given that the difference is 4.1 percent. And then this is the Z score. So all of these results are pretty statistical significant. Another example, people who are trying to lose belly fat. They're more likely to talk about burning, ab workouts, you know. And then they're more likely to mention videos and play lists and stuff like that afterwards as well. I'm just going through a couple of samples. If you have any questions about something specific, let me know. If you mention triglycerides, you're more likely to talk about your risk or statin and lowering, I assume, blood pressure, cardiovascular issues, healthy diet, fatty acids, and things like this. If you mention that you have a pension or you're saving for retirement, then there's a whole another taxes and retirements and budgets and stuff that start to come out of this. We can also, as I mentioned, start taking a look at the temporal evolution of these terms. There's a couple I want to mention. So most things in general, most of our outcomes we saw roughly cooccurring with the mention of the target. So occurring on the same day. So this graph is the number of days before and after. The treatment is the X axis and Y axis is the expected number of tweets and each of these grouts has a different scale for that. So I don't label the axis directly. But so, let's see -- there was one I wanted to find here. So yeah, so for example, which one was I going to go -- oh. Pain. So tramadol is a painkiller and so people take it. They're mentioning pain, that's the red line. The treatment line is the people who took tramadol. And their pain goes up but the same day that they're talking about taking tramadol and then after they take tramadol, the pain mentions go away. And then about a week later, they start talking about pain again. I don't understand tramadol well enough to know why. I don't know if the course of medication lasts a week or if this is commonly used for certain kinds of illnesses where pain reoccurs weekly, but this is the type of thing that we're seeing and, you know, obviously, we'd like to go in with a domain expert and understand these better. We talk about we see people mentioning Xanax and weed. So it's important to note that people take Xanax both medicinally and recreationally. Apparently. And so you see people who are taking it recreationally then mentioning weed. We have started to do some work on trying to cluster these outcomes so we can understand there's a certain class of outcomes that might occur for some people, maybe different outcomes occur for different people. But this is just, I hope, gives you an idea of the types of temporal patterns that we see. >> What's the weird person? >> Emre Kiciman: Weird person? That's unfortunate. I think that's people who are using Prozac is basically slang to denigrate someone. So they're saying he's a weird person or they're on, you know, something like that. >> I'm just wondering, does negation factor into this as a lot like we have weed or we could have no weed or something like that? It's a [indiscernible] bigrams or unigrams or, I mean, is that factored in? >> Emre Kiciman: No, we don't. Yeah. So that's one of the types of things that would be really great. So for example, we see in a lot of these anxiety drugs and stress-related stuff that people do something and their stress goes away and they're saying, oh, I have no stress now or my anxiety is much better. And so it's actually -- this is simply just people mentioning the issue and it could be that it's going away but they're mentioning it more or it's actually not going away and increasing. >> Just the mention of the event itself, whether it exists or not, doesn't matter, just ->> Emre Kiciman: Correct. And it would be great -- that would be the type of thing that I would want to put into the initial generation of the timelines. When we're converting from the raw text into this representation of what people are experiencing, it would be great to say that's not weed instead of weed. Maybe we could pick a different example. Well, anyway, doesn't matter. You understand. Great. Yes? >> So these lines are tracking the outcome mentions? >> Emre Kiciman: >> Yes. The boundary between gray and white is where the treatment mention is. >> Emre Kiciman: Correct. Yes. Sorry I didn't explain that. >> So it's a little surprising that we're not seeing more talk about weed early on, people who are taking both Xanax and weed recreationally. >> Emre Kiciman: Correct. Yeah. Yes. It's possible that there's a temporal thing going on that people -- maybe there's some change happening between the February through April tame frame where, I don't know, maybe people are going to more outdoor raves or something like that getting to the spring. So it's quite possible we're seeing systematic artifacts like that. >> And what was the timeline? Like in the last couple years? >> Emre Kiciman: March 2014 is the time when we found -- when we looked for the target events and then we grabbed a month before and a month after as well. >> I was just wondering if in that particular case, it's used some of the legal changes around the country of the effect of that. >> Emre Kiciman: Possible. Spring break could be -- I don't know. There's all sorts of things that -- yeah, yeah. So this is only three-month. It would be great to start applying this to really longitudinal data sets and I hope and hope that that would start to factor out some of these things. Yeah? >> How confident are you that this data is representative of the larger population of people who don't freely share their stuff on Twitter? >> Emre Kiciman: I'm pretty confident it's not. [Laughter] >> Emre Kiciman: But more seriously, so there are going to be some domains where people are more open and more open to discussing what's going on in their lives. There's also a lot of domains where even if most people aren't talking about it, the ones who are having a representative experience. But bias is a big issue here. And we don't have a good handle on the bias that goes into generating the social media timelines. There's some things we understand. We know people are more likely to talk about extreme events, less likely to talk about everyday things that, I mean, I think the rule of thumb is you know, do you feel like your friends would think this is boring? If so, many fewer people are going to mention it. But no, we don't understand that bias very well. There's a couple of experiments that would be nice to get started to understand that better, but it's a really hard one. Okay. Evaluating correctives. So here, this is where I mentioned we took a look at mechanical Turk workers to judge the correctness of these results. And then we calculated based on that precision at a particular rank. And yeah, I know we mentioned I'm not going to go into the knowledge-based thing here. So to help our judges understand whether something was correct or not, right, so one mentions, say, the treatment is dealing with jealousy issues and the outcome is wake up. What we'll do is we'll show them an example tweet where someone is mentioning jealousy issues and then we say later on, the same user, so we find a tweet by the same user, the same user later on says, I need you to wake up because I'm bored. And now, we'll give each Turker two to three of these pairs of messages. We'll then generate several sets of these paired messages for each of these treatment outcome pairs, and we'll get the Turker to say, you know, is someone who is mentioning X dealing with jealousy issues later more likely to talk about wake up. So here, I would have judged this one as wrong. And so -- well, actually I should take a look and see what our workers said for this specific example. But then you can see others suffering from depression. You know, if you think depression, okay, and then later on self-harm is the outcome. This one, maybe. Paying credit card debts and then talking about apartments later. So someone has said they paid that credit card bill and then later they talk about checking for apartments. That one, maybe they're really more likely to talk about that. So this is the type of task we're giving to our mechanical Turkers. And in aggregate, we found that when we sort all of our outcomes by their rank, so the top five outcomes, top ten outcomes, et cetera, we found that the best -- I highest rank outcomes had a precision close to 80 percent. Where people were just judging this is relevant or not relevant. And then in it drops from there. And we see basically correlated with where are both our average treatment effect which is how this is ranked as well as the statistical significance rank things. And it drops down to cumulatively about between 50 and 60 percent. Non-cumulatively, you're dropping into like the he low 40s to high 30s down at the tail end in terms of the precision. We do see variants in precision across domains. So some domains are doing better than others. So some of the law is doing really well. Health is pretty consistent and doing just over around 65 percent or so precision. Financial services are deemed pretty good and construction maintenance was our worst. A lot of this change in precision is actually due to data value. So as we get not all of our scenarios had the same numbers of users, as we saw more users in our data sets, the precision tended to increase. I'd be happy to talk about more things as we get to the end of the slide. I don't know how much time we have or if I've already gone over, but so a future direction, there's many missing capabilities in this system. We're looking at how can we start to reason about not just binary events or binary experiences but also continuously valid event, looking at frequency based analyses and also just generally being more aware of time. So right now, we ignore whether something happens immediately or whether it happens a week or two in the future. And that makes a difference calculations. And we're also basically going to reintegrate some of the text parsing that we had done earlier into this analysis. I'm working with some distributed systems researchers to build out a system that's capable of doing this analysis at scale. Right now, these 39, 40 scenarios I mentioned were done by pooling basically a copy of the Twitter data and for that particular scenario and running things, you know, off line. We'd like to be able to do that much faster and at large scale. With several really great collaborators, we're looking at domain-specific analysis for not just individual level questions but also some policy issues. What are people's experiences with, say, bullying or with depression and things like that. And I'm also looking at application ideas. How can you start to use this data to provide a nice backing for other analyses or other kind of interfaces. To conclude, we focused here on demonstrating how stratified propensity score analysis can be applied to personal experiences, to extract outcomes of specific situations. We applied this to a diverse set of domains and evaluated how well the outcomes looked to mechanical Turk judges. And, yeah. That's it. I'd love to talk to you more about this work and answer any questions you have. Thank you. [Applause] >> One thing Twitter can give you, at least in some cases, is location. I'm wondering if you've considered whether some of these outcomes might differ by region or by country and if you looked into that. And >> Emre Kiciman: No. So exact location, you would only have for 1 or 2 percent of tweets, but you're right. We can identify kind of larger regions, like what city or what state or country people are in for about 60 or 70 percent of tweets. We haven't incorporated that as a feature in our analysis. At least not explicitly. So if people imply their location through their text, that might be getting captured here somehow in the analysis of the text, but we're not taking it into account in any other way. But you also raise another kind of broader point which is that the outcomes are likely to vary by individual. And we do see traces of that. So looking at even just our stratified outcomes, we do have examples where -- let me see if I can pull one up. Let's see. So this is from a different analysis. We do have examples where, for example, this is a likelihood of people to perform suicide ideations based on analysis of not Twitter data but red eye data. And we see that if people mention the word depression, their likelihood of later performing suicide ideation, talking about suicide goes up and how much it goes up depends on their likelihood of talking about -of -- their likelihood of mentioning the word depression. So as the propensity goes up, you know, if they're not going to likely talk about depression, then if they do, their likelihood of performing suicide ideation goes way up. If they are very likely to talk about depression, actually using the word depression doesn't seem to make a difference. That's one example. Yes. Go ahead. >> In the realm of other data that Twitter gives you that would be interesting to look at, the network effects. If someone has a high percentage of app mentions with another user that did mention suicide, are they more likely to take on mentions of depression or vice versa, do you think? >> Emre Kiciman: Yeah. I don't know very much about how the corrections you have to do for the math and the analysis for analyzing network effects of treatments. But yes, I mean, there's all sorts of interesting interactions. A lot of interesting things that you can treat as being a action that you want to analyze the outcome of. And how you get that into the timeline, yeah, you could -- there's all sorts of ways you could do that. Yes? >> A couple questions on the crowd sourcing. You selected the top -- you said the top three outcomes that you had judged by the crowd sourcing judges. >> Emre Kiciman: So for every scenario that we analyze, that we had run our analysis for, we selected the top 60 scenario. 60 outcomes. And then for each of those outcomes, we gave every mechanical Turk worker, we gave each judge at least two paired messages so a treatment outcome and then user two says treatment outcome and then assuming we had enough data, we would then also repeat that with a different pair of treatment outcomes three times. >> And then how many judges did you have for each set? >> Emre Kiciman: I believe we had three. But I'd have to check. >> So it would be interesting to see whether if you exploded that out into 10 or 20, what the distribution of that would be because it seems like it might be very subjective whether wake-up is associated with -- I forget what the control was. >> Emre Kiciman: That's true. Yeah. With jealousy. >> Jealousy, wake up and smell the coffee, he's not that in to you, kind of thing. >> Emre Kiciman: Yeah. >> Wake up in this particular instance, you know, the context might affect the judgment and it might actually be interesting distributions of what people think would be more likely to be mentioned given the ->> Emre Kiciman: >> After. No. Before? The stratified results or the -- these? The one with the blue and orange. >> Emre Kiciman: >> Mm-hmm. The one with the reps. >> Emre Kiciman: >> Yeah. Would you go to the slide after this? >> Emre Kiciman: >> You're right. Okay. Sorry. Yes. I skipped around. Can you explain the last one with the can't? What's -- >> Emre Kiciman: So this was -- so this was actually -- this was results from a different analysis that happened to be in the slide deck. Here, so in the presentation I talked about an effects of cause analysis. So something -- yeah. An effects of cause analysis where something happened and we look at the effects afterwards. Here, we were actually doing a causes of effect analysis. So we were looking for what happened that increased the likelihood of one event. So we iterated over all the -- so before I said we picked, you know -- looked at the outcomes Y, here we're actually looking at the precursors X and iterating over those. And so here, all of these graphs are showing the likelihood for people to post in a suicide ideation form. This is joint work with some collaborators [indiscernible]. And we're looking at basically the words that seemed to increase the likelihood of people doing this. Yeah? >> I was wondering if it would also make sense to see if after, you know, a given event, you could actually cluster people into different groups that sort of react other talk differently about that event. Right? So imagine, you know, some -- a bunch of people taking drug X. And many of them may experience the expected effects and then a small group there, but consistently, may experience adverse effects. Right? So by doing that, that spread, you could actually determine different outcomes and different groups of people that experience things differently. >> Emre Kiciman: Yeah. >> Other thing is I think in psychology, right, I mean, that will be -- it will be important because people's perception of the world of events, you know, is very important for their psychological state. A certain event happens, some people might just move on very easily whereas others tend to be very affected by it. >> Emre Kiciman: So we've done some very basic things to look into that. What we've done is we've starting clustering the outcomes based on the user IDs, whoever mentioned them. And that does give us a split that, say, takes the recreational drug users' outcomes and separates them from the medicinal drug users' outcomes. There is that type of thing. Now, more generally, there are methods in the econometrics literature on how to calculate heterogeneous treatment effects where you more directly look at what prior features seem to be important when actually calculating the effect of the treatment. We haven't done that yet. Yeah. >> Just a comment that I think Twitter's a real interesting source, but it might be interesting as well to run this over medical data, like patient discharge records that are very fact-based that talk about what state the patient is in and subsequent outcomes. That might be very -- yeah. I think because that type of English is typically just all about facts, where on Twitter, seems like there might be some more noise interjected into it. >> Emre Kiciman: Yes? I agree with you. That would be pretty cool. Mm-hmm. >> You were mentioning that you classified the data on the ID. So some metadata that you got from the tweeters but sometimes the metadata in Twitter is very poor. Very bad maintained. So what did you really use as a metadata for classification? >> Emre Kiciman: >> Which classification? You mean the -- Divided by ID? >> Emre Kiciman: Oh, sorry. Yeah. I did not mean to mention that. I didn't mean to imply what I think I must have said. No. So I'm looking for the slide where I think that that might have come across. Oh, well. So what we did was -- what we did was we did not look at any of the metadata from Twitter. We only looked at the language. I did at one point -- I couldn't find the slide just now. We aggregated by user ID, but we did not otherwise do anything else. So the fact that the user might have had a particular home page or have a certain number of followers or say their location is somewhere or imply gender by their name was not taken into account. Yes? >> [Indiscernible]. I'm wondering if you put any of the pairs that you ranked low before the Turkers. >> Emre Kiciman: We didn't. >> To see how they ranked those. Because I mean, some of what you're getting is interesting. Implicit data that people may not have intuitions about. >> Emre Kiciman: Yeah. I would -- I mean, so -- I'm pulling this up. So this is the cumulative figure. But in the non-cumulative, I said that these lower ranks, the just looking at these last results and stuff, you're around the 35 to 40 percent precision range. I would, having looked at what gets ranked down like at the 50th result for a specific scenario, I treat that as a control. Those start to look pretty bad. >> And then the three judges. yeses? You have two out of 3 and 3 out of three were >> Emre Kiciman: Actually we averaged. So we ask them actually on a -- I believe it was a four point scale. We said that, you know, it's either not relevant at all with zero and then completely relevant was a one. And then we had probably relevant, probably not relevant. >> You have three users and you averaged this score for the users on -- and their score. >> Emre Kiciman: Yes. And then we have across our 39 experiments, we have five top five scenarios for those 39 and then that's what makes up this distribution in this box right here. Yeah. >> [Indiscernible] that you had on there? >> Emre Kiciman: Not off of top of my head. Sorry. Yeah. Yes? >> Okay. I remember [indiscernible] talk about a control group. So it seems that now for control group, you use everyone who is not having a treatment. Have you tried to look at control group that look like the people who you have the treatment but [indiscernible] treatment? >> Emre Kiciman: So the -- come back to here. When we do the stratified analysis, if people look like the treatment, then what will happen is that we won't do a good job separating the treatment from the control population when we do the stratification. And you'll get a lot of common support. A lot of your strata will have enough people on both sides to do a good statistical comparison. If we do a poor job selecting the control users and they separate easily, so it's very easy to tell someone who is going to take Xanax versus someone who is not. Say we have some systematic issue with the sampling for example. Then what will happen is that all of these green dots will come up to the high strata and all the red dots will be pushed down to the low strata and then we won't actually have any comparison here. So when we have a control group that's partially comparable and partially not, what happens is the control group goes way down to the bottom. It's very easy to separate them out. And then we have some support in the middle where people are similar. >> [Indiscernible] maybe the people in control group are so different from people who are in treatment. >> Emre Kiciman: Yeah. >> So therefore, the difference is not only about whether they have treatment or not but about a lot of other stuff. Like the [indiscernible] factors you mentioned. >> Emre Kiciman: So when they are very different, then they get put in -all the control group will get put into a very low strata because it's very easy to predict that they're not going to take the treatment and then there's no one left in the treatment group to compare them against. So we don't get statistical [indiscernible] that strata and we can tell that those people shouldn't be included in our comparison. So that's something -- you bring up an important detail. When we calculate the average treatment effect, we only calculate it over the region of common support where we have enough control and treatment users in a strata to give us a good result. If we don't get that common support, then we basically end up essentially ignoring the users. Both treatment and control who are in a strata that doesn't have enough of the opposite person. Okay. Thank you very much. [Applause].