36454 >> Michael Gamon: Okay. Hello, everybody. Welcome... of these symposiums. So it's been a while that...

advertisement
36454
>> Michael Gamon: Okay. Hello, everybody. Welcome to what's now the 37th
of these symposiums. So it's been a while that we've been doing this. We
have the usual format. So we have two talks. One from the U-dub site. One
from the Microsoft side. And just a little bit of logistics. Drinks and
coffee are right out there. Restrooms are this way. If you parked on a
visitor parking spot, you should register your car with the receptionist.
And I also want to point out there's a whole bunch of lunchboxes in the
corner there. So the caveat there is they're not ours. They are free for
the taking, but they have been out since lunch. So at your own risk, but the
cheese and other stuff is up to date. And, yeah, we're going to start with
Ryan and Fei is going to introduce Ryan. And then we're going to move on to
the second talk and I'll introduce Emre at that point.
>> Fei Xia: So it's my pleasure to introduce our first speaker, Ryan Georgi.
He's graduated from UC Berkeley for his bachelor degree, and then he joined
U-dub first as a CMS, CLM echelon, and then he continued into the Ph.D.
program. So he worked with me and also today, he's going to present part of
his dissertation work.
>> Ryan Georgi: All right. Thanks. So I'm going to be talking about
automatically -- is it on? So I'm going to be talking about automatically
enriching interlinear glossed text. And before I get to all of what that
means, this work is kind of in cooperation with two other projects at the
university, the RiPLES and Aggregation projects. So here are team members
for those two projects is the links if you're interested. And we're
supported by an NSF grant. So before I get into what IGT is and what it's
used for, just some background information. In our field, we've been
developing a lot of new technologies and expanding language coverage beyond
just English and Arabic and French. But building these new tools often comes
with data requirements. So that's annotated data, unannotated data in large
amounts for unsupervised methods or both for some supervised hybrid systems.
And acquiring this data is neither cheap nor quick. There are some
particularly interesting high-coverage projects out there. So the Crubadan
Project is a project focusing on low-resource languages. It's got 2,124
languages and the indigenous tweet database by the same author. It has 179
languages, indigenous languages from around the world on Twitter. So that's
a very impressive amount of data, even for unannotated sources, because these
languages are ones that aren't typically seen very often. That being said,
we know of at least 7,000-plus languages known to be currently spoken. So
that leaves, even with those impressive coverage results, quite a lot of
leeway. So most of these languages are low resource. They're not always
large speaker populations or strategically important to western countries
that are funding a lot of this. Some of them have as few as 80 speakers. So
spending a lot of time and effort developing electronic resources might not
be very viable. But still, having these resources could help answer some
large scale type logical questions about humans and our languages. So I'm
going take a look at how we can approach some of these 7,000 languages
programatically with some pretty interesting resources. So one common way of
getting -- leveraging of low-resource language to create resources from
resource-poor ones is by a projection where you take annotation on one
language, align it with another, and project the results, map them 1 to 1.
The first problem that presents is how do you achieve high-quality word
alignment? Then how do you deal with unaligned words? And finally, how do
you deal, if the two languages diverge from one another in how they represent
meaning, even if your alignment says that these words are roughly equivalent?
So I'm going take a look at way to see address all of these. So what is
interlinear glossed text? For a lot of linguists, it's going to be a pretty
common site. It's typically used to give examples in linguistic papers. So
this is an excerpt from a dissertation on non-macro role core arguments. I
don't know what that is. But the German example here is nice and clean. So
just to take a look at that a little bit more closely, IGT instances
typically have a few interesting properties. They have three lines, so the
language, gloss, and translation. The language line is going to be whatever
the native language is. The gloss is this interesting hybrid English and
morphosyntactic annotation. And then the translation is typically a natural
sentence in typically English. So one thing we noticed is that the words on
the gloss line and the translation line are often mirrored. So we have
Peter, we have children. They're not always in the same place because the
native -- the language we're annotating might actually present it in
different order. But if we look at the way the words occur on each line, we
can infer word alignment. Then, we can subsequently use this alignment to
project annotation. So if we part of speech tag the translation line, we can
use that inferred alignment. Then, sense the gloss happens to match up one
to one with the language that it's annotating, we can just follow the
alignment and assign those part of speech tags to the language there without
having any annotation directly on the language to begin with. Additionally,
talked about the morphosyntactic information there. Here, it's presented in
the form of grams. In this case, they're case marking or, well, case
annotation, since the morphology is a little embedded and complex there. But
that's an interesting little bit of annotation that IGT provides as well. So
using IGT, we have a really fascinating resource that Will and Fei have been
working on called the online database of interlinear text or ODIN. It
consists currently of 3000 PDF documents that have been automatically crawled
from the web. 200,000 IGT instances and that covers 1,400 languages. And
right now, we're actually in the process of expanding -- oh, and we have
1.3 million new documents. Those were retrieved with a high, what we call,
low-precision method. So not all of them necessarily have IGT in it, but
even if we're talking about five percent of those being valid IGT instances,
we're talking at least one order of magnitude in the number of instances and
who knows how many languages that will add to the database. So actually
using IGT, start with word alignment. Talked a little bit about how we can
do that following the repeat words. And so ODIN has high coverage by virtue
of having all those languages, but each language doesn't have necessarily
that many instances. So most languages in ODIN have a fewer than a hundred
instances in the given language. That's not really enough for typical
statistical alignment methods. If you just try doing alignment on a hundred
sentences or so that we have there, we found that in our results, they give
us about, F score, about 5.56. So instead, kind of look at leveraging the
gloss line like described earlier. So I have three IGT instances here.
First is Oriya, a language spoken in southern and eastern India. Got
Turkish, which is of course spoken in Turkey, other parts of Europe and
central Asia. And Yukaghir, which -- or Yukaghir, which is an endangered
language in eastern Russia. And this only has about 70 to 80 living
speakers. So you'll notice that we have the gram here for third person
singular. In Oriya, it's unclear whether that's some kind of clitic or maybe
just a pronoun, some kind of agreement marker. In the other languages, it
looks a little bit more clear that it's some kind of inflection and agreement
on the verb. In any case, we're seeing that same token throughout these very
disparate languages. And in addition to the grams, we have all these English
words scattered throughout the ODIN database on all these different
languages. So while a hundred parallel sentences might not seem like much
nor a particular language, 200,000 instances that share this kind of pseudo
language of English words and grams that we see in repetition throughout the
different languages, that's something you can get some traction on. So if we
use the data for all the languages in ODIN, that can actually benefit the
alignment for each language since we don't actually need to use the ODIN
instances for the language we're looking at in order to have a database or
have parallel sentences between this gloss line pseudo language and the
translation line. So just to compare some of our results from different
alignment methods, the first method is just statistical alignment. If you
just run Gija++ on the English and the foreign language for each one of these
languages and it's the IGT instances in the ODIN database, the results are
pretty, as you might expect, pretty bad, between .5 and .6. If you do the -take all of the gloss lines and all the translation lines from the ODIN
database and throw those in, you get quite a big improvement as you might
expect from having so much more data. The third method, the statistical plus
heuristic there, that is a case where you take the heuristic method, which is
where I was talking about the translation line and gloss line having those
string matches, if you take all the words that match and throw them into the
liner as their own parallel sentences, you get a little bit of a boost, but
not much. And then the heuristic method actually works really well. The
reason is the recall is kind of a wash for most of the methods but it's
just -- heuristic is so much more precise given that it really zeros in on
those shared words between the languages or between the gloss and
translation. So using the word alignment there, that's really handy, but
we'd like to get some more information, like part of speech tags for these
languages. Even when we have that high quality word alignment, the
projection might still be problematic. So this is actually a case where even
with our heuristic method, we don't get a whole lot of traction. This is
from Chintang, which is an endangered language spoken in Tibet. And it just
happens that [indiscernible] the folks there have a wonderful team where
they've gathered thousands of instances of IGT for Chintang, so it's a really
great resource we took a look at. The problem with this instance here, you
might see, is that despite the gloss being very helpful and felicitous
transliteration, of the Chintang data, it really has no overlap whatsoever
with the translation. So to try to do our heuristic approach simply won't
work. And that's a problem if we're trying to project annotation. So if we
use projection alone, we found that our part of speech accuracy was only
12 percent. And that's because the vast, vast majority of the tokens were
simply unaligned and had no -- we had no ability to recover a part of speech
tag. So that's not that it's bad for all IGT instances but there's still
room for improvement. Just to see the results in a couple other languages,
it ranges from just one percent in French to 25 percent in Bulgarian. These
were just some small instances we used for evaluation. But if we worked
correctly on the gloss line using some of the information provided there and
bypassed the alignment problems entirely, that can boost our accuracy quite a
bit. So to give some information how we can kind of ignore the alignment all
together, we've got two more instances here. This is Gwi. I have no idea
how to pronounce that either. But it's an endangered language of Botswana
that has about 2 to 4000 speakers. And then an instance from Yaqui. That's
a [indiscernible] taken language. About 16,000 speakers in the Sonoran
Desert ranging from Mexico to Arizona. And again, we see that the one SG
gram, the first person singular, is present in all these cases, but it's not
always the same meaning. So in the first instances, it's inflection on a
pronoun. In Yaqui, it seems to be inflection on a verb. And if you're
really [indiscernible] you'll actually notice that in the Gwi example, we
actually see the nominate and imperative there so there's actually inflection
on the pronoun that determines the mood of the clause, which is just a
fascinating little aside there. But the take away is that the gram is a
really helpful indicator of what the part of speech tag might be but it
doesn't map one to one. If you see first person singular, you don't know for
certain what type of -- what the word class is. So instead, if we use those
as features on a classifier, we can train it by starting off with labeling a
small set of instances with their part of speech tag. We're going to use
that as the target label for a classifier. So we start with a couple
features. First is the most common tag for English words in the gloss. So
we have a place, precipice, do with, and we just add those as features for
the label. So the non-English tokens, the grams, you can just use those as
is, as binary features, their presence or absence. And then take all those
together and use those to train the classifier. Now, when we run the
classifier directly on the gloss line versus projecting from the translation
to the gloss line, we get a huge boost in accuracy for part of speech
tagging. Largely, that's due to the fact that we don't have to worry about
unaligned tokens anymore. We can actually just take a guess at the word with
our classifier and its features that it fires. So the next step is if we
have a classifier that we trained with some manually labeled instances, that
is great, that's a couple hours work. Got some good results. But what if we
use more of the ODIN data. So projection isn't particularly great. It's got
lots of gaps, as we discussed. But it is really precise. So if we project
part of speech tags over the 200,000 ODIN instances, and compare that to just
the manual labeling on 143 instances, what would that look like?
Furthermore, if we transfer the gloss line part of speech tags to the
language line, and you use that to train a monolingual classifier or
monolingual tagger, how does that compare? So some results for this, and
just to step through them, the first thing of note here is that this system
here, the manually trained classifier was based on gloss lines that had been
annotated for some languages. For Indonesian and Swedish, there was actually
no Indonesian or Swedish instances used at all. So this is a result that is
using no data other than IGT instances for these languages. The ODIN-based
classifier here, these are all results using the projected part of speech
tags to train the classifier. And it actually performs better than the
manual classifier and there's no language specific information from human
intervention whatsoever involved in this system. And then finally, a
supervised system's always going to outperform these. These aren't the most
stellar results. Going to get somewhere about 90 percent. But if you
actually, even using a supervised system that uses about a thousand
sentences, which is more roughly in keeping with the amount of IGT data
that's going to be available, the difference isn't quite so striking between
the systems here and the supervised system. And most importantly, all of
these systems here, the ODIN-based systems, can be trained for any language
in the database and require no manual intervention. Whereas the supervised
systems actually need to have someone sit down and create that annotated
data. So finally, we also did some experiments in parsing. We did some
dependency parsing and phrase structure. Parsing, we can project those
structures just as we did with the part of speech tags and some clever tree
manipulation. And the part of speech tagging that we performed earlier can
actually help with those parsers when we have some part of speech tag data.
A lot of the dependency parsing that's done in this area is typically
unsupervised and if you assume that you don't have part of speech tags, it
makes the whole process much, much harder. So one other thing that we did
with the parsing, in Georgi et al. 2014, we actually used dependency trees.
We used languages that we had some dependency trees for, and just a few.
Going over those dependency trees and projections, actually learned how those
trees diverged. So the projections would come up with one answer based
strongly on the English interpretation and the English parse, then comparing
those to the trees that we had for the other language, we could actually see
some common patterns. Then if we actually applied corrections to account for
these divergent patterns, we could improve the accuracy from 55 percent to
75 percent. So some -- just an example of some patterns. Sometimes we'd see
multiple alignments in a language. And knowing the general attachment
headedness for the language would be would be helpful. If we knew there was
left attaching or right attaching from the trees, we could figure out a
solution approximate for which word in the multiple alignment would take
precedence. The other case was with swapped words, we found that in
particular, in the Hindi trees, Hindi is a post positional language and so we
would often see the English prepositions being given the head of the
dependency parse. And we found in the Hindi tree, that we actually want
those switched. So just a few instances was able to get us to learn the
divergence pattern. So, finally, what can we do with all this? So
projecting the part of speech tags that we arrive at can be used for some
interesting typological questions. So basically word order, the morpheme
order, and whether or not we have definite or indefinite determiners. And
Bender et al. Emily's group, looked at determining case type using the gloss
line and the part of speech tags on there. So whether a language was
ergative-absolutive or nomative-accusative, that's a question that you can
answer by trying to figure out the case markings on the gloss line. Finally,
projection is a common solution. Just to wrap up, we looked at how we
addressed getting the word alignments using the gloss line and the various
heuristics and methods we could use to improve alignment there. Then as for
the unaligned words, just skipping over those entirely and focusing on the
gloss line to train a classifier can kind of obviate that problem entirely.
And then for language divergence, both using it's classifier to look at the
gloss line itself and not assuming that it's going to have the same part of
speech tag as the English translation, but also learning those divergence
patterns from the comparison between the trees. So finally, the actual
INTENT package itself is at the core of all this. And that stands for the
interlinear text enrichment toolkit. So that's the software that's been to
automatically enrich -- do this enrichment. And it projects the dependency
and phrase structures. And currently, we're about to release the 2.1 version
of the ODIN database with -- not get with the expanded coverage, but with the
current 1,400 languages and in addition, INTENT is going to be used to do the
part of speech tagging dependency structure and phrase structure projection
and word alignment and provide those enriched files in a 2.1 release. And
it's going to be using the XIGT formalism, which is a XML version or an XML
serialization of IGT data that makes it very nice and easy to add annotation
tiers to IGT instances as well as alternative tiers. And so there is a basic
web interface too for INTENT that you can play with to create your IGT
instances and see how they work and just to show a quick example of that, so
this is just a quick little example of Chinese -- I'm told that this is a
nice ambiguous sentence in Chinese between Zhangsan does not like anyone and
no one likes Zhangsan. So if you go ahead and run this through INTENT, just
it will create the XIGT file for you. It's a little intimidating looking,
but we're working on it, on a tool to process that and make it easier for
humanitators. And then if you -- this just gives you the XIGT output. Then
running it through the enrichment, you can also see we get the part of speech
tags back. And I think in this case it was actually Zhangsan who/whom all
not like. So noun, who/whom, pronoun, all, debatably a determiner, not,
adverbial, and then like is -- we realize that the -- like is probably seen
more often as a metaphorical add position than it is with warm and fuzzies,
given that our database is largely going to be trained on the Wall Street
Journal type stuff. So yeah. You can play around with this. The results
aren't always going to be the highest quality. But you know, we can use this
for any language in ODIN. So we're going to be working on that in the coming
months. So any -- I think that's it? Any questions?
[Applause]
>> Ryan Georgi:
Yeah?
>> So, have you considered typological constraints in these projections?
Like the thing that jumps out to me about the example you just gave is that
there's no verb in that sentence. So there should be a verb.
>> Ryan Georgi:
>>
In the Chinese or --
In the Chinese.
>> Ryan Georgi:
Okay.
>> So could you use that typological constraint as a way to do better part
of speech projections [indiscernible] between like and like?
>> Ryan Georgi: So using the typological constraint, you mean feeding it in
the information that there's -- that that's not -- for Chinese, that that's
not expressed as a verb, you mean? Or ->>
[Indiscernible] that it's a sentence, we hope to see a verb.
>>
Right.
>> Ryan Georgi:
>>
[Indiscernible].
>> Ryan Georgi:
>>
Oh, okay.
Yeah, yeah, I gotcha.
That is a verb.
>> Ryan Georgi: Yeah. No. And actually, that's a good question. One of
the problems with IGT is that it's not always full sentences. So a lot of -part of reason that this is as kind of messy seeming as it is, and I didn't
get into it too much, is that the ODIN database is from PDF documents and
this nice neat XML format that we have is something that we developed to
store it in but the PDF documents themselves, we actually have to extract it
from using PDF to text conversion. So it's all sorts of interesting
corruptions from the PDF to text conversion but then on top of that, IGT is
also occasionally sentence fragments or sometimes just single words. So it's
not always easy to treat it as a stand alone sentence. Although that's a
good idea, that maybe defaulting to that would be a good thing to try because
probably, most of the instances are going to be whole sentences. Yeah.
>> Thinking further about this error, because it's interesting, in this
case, you've done the part of speech [indiscernible]? Is that true?
>> Ryan Georgi:
Yeah.
>> But you have a really nice alignment. Presumably, if you look at the
source language [indiscernible] the part of speech [indiscernible] -- sorry,
the tags for the [indiscernible] translation line did not treat that like it
was in that position.
>> Ryan Georgi: Yeah. Actually, I think we looked at this the other day and
it was -- if I can see it here. Because I think it was the case that the
English language tagger -- okay. So this is -- this it here.
>>
[Indiscernible].
>>
We're not seeing it.
>> Ryan Georgi:
>>
Oh.
[Indiscernible].
Maybe it actually did -Yeah?
>> Ryan Georgi: Oh, it does. Does not like -- yeah. Because -- and this
was -- this is where we realized that, oh, wait, where was our part of speech
tagger for English trained Penn Treebank, which is Wall Street Journal. So
again, they're probably not talking about like, you know, McDonald's warm
fuzzies for Burger King. They're probably talking about some, you know,
metaphorical comparison. But ->> So in general, could you combine the gloss based part of speech tags with
projected ones in the case that you have alignments from that [indiscernible]
the heuristic alignment?
>> Ryan Georgi:
>>
So like a back-off, like start --
Back-off for somebody to combine those [indiscernible].
>> Ryan Georgi: Yeah. So there are actually two things that I didn't talk
about directly in here that I tried, and I didn't put them because the
results ended up not being very promising, but the first is using projection
and then filling in the underlined tokens of various methods. One being just
let's call everything a noun and see how that does. The other was actually
running the classifier on those unprojected tokens. That was one way of
combining them. The other way was actually using the classifier and then
feeding into it as a feature what the part of speech tag that was assigned to
the word aligned with it, if it exists, is. And so that actually didn't -that tended to hurt more than it helped interestingly. I think the feature
of just, if you have an English word, what's its most common tag is so
strongly predictive that it actually didn't help above that. That being
said, the supervision that we -- or the evaluation corpus that we have for
this is still pretty small. So a lot of the results and the conclusions
we're drawing are still kind of preliminary. So -- yeah.
>> Given that there's a large number of pairs of languages, could you
explain how the heuristics work for the [indiscernible]?
>> Ryan Georgi: So the heuristics that we worked with, and I think
[indiscernible] said there's a huge number of pairs of languages, right?
Yeah. So I think in our database, there are some instances where the gloss
and translation are German or non-English, but the vast, vast, vast majority,
we just -- it's so overwhelmingly English that we just assumed that the one
of the pair is English. But that being said, what the heuristics entail is
typically we'll break apart the morphemes, compare those individually. We'll
do some stemming just to see if there's running and ran or some, you know,
alternate version of the word. Also compare the gram if we have like a 3SG,
just see -- standing all by itself, we'll see if there's a he or she or it
fleeting around in the English. Pretty simple things, but putting them all
together, we get a pretty good alignment.
>> I was working if you could also get a lot of insights from the types of
errors that are being made or cases where, you know, the classifier, for
example, has low conference. I think that would indicate that, you know,
you're dealing with a phenomenon that's either like [indiscernible] or that
might be worth more human investigation.
>> Ryan Georgi: That's a good point, yeah. I hadn't looked at that. And
actually, in finding instances to kind of talk about that case with the
imperative inflection on a pronoun actually brought to mind like I really
wonder what the classifier thinks of this because that's a really bizarre
thing to see on here. I have a suspicion that things like imperative or
perfect or anything like that is going to be so strongly weight for a verb
that it might [indiscernible]. Yeah, it would be interesting to see the
confidence scores on that to see like, okay, what's happening here?
[Applause]
>> Michael Gamon: So the second talk is by Emre Kiciman and Matt Richardson.
Emre is going to present this. This is work that actually was presented at
[indiscernible] this year, right?
>> Emre Kiciman:
>> Michael Gamon:
>> Emre Kiciman:
>> Michael Gamon:
>> Emre Kiciman:
It's in submission at dub, dub, dub.
Oh, it's in submission.
Oh.
And it was presented at KDD.
KDD, okay.
[Indiscernible].
>> Michael Gamon: So there's also -- there's a part here that has been
present but there's also a part that is entirely new. And to introduce Emre,
I mean, I have the good fortune to be able to work with Emre on a number of
projects involving social media, and that's an area that he's very interested
in. And the talk toed is to -- how to actually discover outcomes from social
media, from the language that we find there.
>> Emre Kiciman: Thanks very much, Michael. Hi, my name is Emre, Emre
Kiciman. And I want to talk to you today about some work that we've been
doing here. Like Michael said, with Matt Richardson, we started this project
a while ago and we had a paper out about some of the techniques that I'll
mention today at KDD over the summer. And the rest of the work is in
submission at dub dub dub right now. And we did that work over the summer
with Alexander Olteanu from EPFL and Onur Varol from Indiana University while
they were interning here at MSR over the summer. So what this project is
about. Why do we care about what happens? Every day, we all decide
something. We all decide to take some action. We're in some situation and
we want to get out of it. We just want to know what's going to happen. And
so we start off by saying, okay, I'm going to pick the thing to do next that
is going to be best for me. Whatever my goal is. Whatever I personally
want. And this works for some people. Some people say, okay, should I do
this? They think it through. They have a good kind of knowledge about the
world around them and about the future and then, you know, they pick the
right thing to do, turns out great. Not all of us, though, do so well. Some
of us, you know, occasionally can't actually predict what's going to happen.
We don't really understand the situation very well and we take an action that
goes off in the wrong direction and doesn't work out for us. So what this
project is about is about saying, well, you know, we actually have a lot of
opportunity to learn about what happens in people's life. What happens after
people take actions or what happens after people find themselves in certain
situations. There's hundreds of millions of people who are regularly and
publicly posting about their personal experiences often to the chagrin of the
people who follow them and listen to them on Facebook and Twitter. But there
is an incredible amount of detail there. And so the question is can we
aggregate this and can we learn about what happens in different situations
and bring that back into how people make better decisions on their own in
their own lives. So our broad goal is to build a system that can analyze
really the humongous amount of information here and let us answer open-domain
questions about the outcomes of actions and situations. Now, like I
mentioned, we did some of this work before with Matt Richardson that -- and a
lot of that work focused on how to go from the social media messages to
timelines of events that we would then analyze. This presentation is going
to be more about how to apply propensity score matching analysis to analyze
those timelines that are based on social media posts and then evaluate the
performance of that analysis across a wide variety of domains. So let me go
into kind of the process that we're doing. First we start off with a number
of social media messages and then we're going to extract out some
representation of what's happening in people's lives based on these texts.
And given some action that we care about, we are going to separate these
timelines into two groups. What we'll call a treatment group that we believe
actually -- where people experienced the action of the situation that's of
interest to us and a control group where people did not experience the
situation. Once we have these two groups of users and their timeline, we're
going to learn a propensity score estimator. So we're going to figure out
basically we're going to learn a function that estimates the probability of
someone having this treatment given everything that happened before in their
lives. And then we're going to stratify the users based on this likelihood.
Now, I'll go into a little bit more detail about what happens when we
stratify these users and why we do it. But then once we have this, we'll go
ahead and calculate for every outcome that we see happening after this event.
We're going to go ahead and iterate and we're going calculate the difference
basically between what's the likelihood of someone experiencing this event
given that they have the treatment versus the likelihood that people
experienced this event given that they didn't have the treatment. We're
going to calculate that for each strata that we see and then sum that up for
the population. So how do we build out these individual timelines for the
things that we're talking about here, the experiments I'm going to talk to
you about here? We used English language Twitter data from the firehose,
aggregated by user ID. We cleaned these tweets to remove URLs, app mention,
stop words, and we applied basically stemming and normalization of common
slangs. So for example, we read the letters. Then we extracted all unigrams
and bigrams of people that people were mentioning as events in someone's
life, and we placed them on a timeline according to the metadata of the
tweet. We identified treatment basically by finding users who mention a
target token at any point in their timeline, and the control is the users in
our data set who don't mention that target token. And in the work at KDD.
We actually describe a set of much more sophisticated techniques. We wanted
to focus here on the propensity score analysis to we basically tried to
remove as much computational overhead as possible for our experiments. But
in this -- in our KDD paper, we found that there's a number of things that
were important. We applied experiential tweet classifications so we only
considered tweets where people seemed to be talking about something they
experienced personally rather than just taking unigrams and bigrams, we
applied phrase segmentation to take out, find kind of cohesive phrases that
people were using. And then we clustered these phrases and then we also
applied temporal expression resolution. So if someone is talking about last
year or last week I did something, we would shift the events along the
timeline appropriately. Talking about the propensity score analysis, I want
to do in one slide just a quick introduction to propensity score analysis.
How many people here of familiar with counter factual analysis and propensity
score analysis? Okay. So I was hoping I could ask someone more questions
after the talk, but that's fine. Okay. So we want to measure the outcomes
of some treatment versus no treatment. So this is like a classical social
sciences, science experiment thing. Right? We have a randomized trial or
something and you want to figure out what happens. We don't have a
randomized trial. So what we can do is we can do a thought experiment. So
ideally, we have an individual who had some effect, so had some treatment, so
we're going to say this individual I got some treatment one, and then we have
the outcome which is this function Y sub-I of one. And we'd like to see, for
this individual, what would have happened if they hasn't taken the treatment.
So ideally, if Michael, you know, took some action, we'd like to take Michael
and observe him in some parallel universe where he's exactly the same, except
he didn't take this action. And then we'd look at every -- basically the
differences and outcomes and that would be the actual effect of this action.
But we can't measure Michael in two parallel universes. He either takes the
action or he doesn't take the action. We only get to observe one of these
cases. So instead, what we can do is we would find, say, Michael taking the
action. We would take someone who is very similar to him who doesn't take
the action and we would basically compare the outcomes here. And if Michael
has a twin who is exactly the same as him, then we can estimate what would
have been the effect of this action by comparing Michael to his twin who
didn't take the action. Now, of course finding someone who else a twin in
every important way is really difficult because, you know, in kind of
technically, I guess you would say we live in a very -- described by very
high-dimensional vectors and the course of dimensionality means that there is
no one who is really very close to you to any given individual. So instead,
we back off a little bit further and we take a look at population level
effects. So we're going to say, we're going to estimate what the outcomes
are after people -- after a whole population of people take this action and
we're going to compare that to the estimate of outcomes, the expected value
of outcomes for people who don't take the action. And as long as these
populations are comparable, statistically identical, then this gives you a
good expectation on the actual outcomes of this action being, you know, one
or zero. Now, how do you get these comparable populations? Randomized
trials is one way. When you're working with observational studies, a
different way is to then apply something like a stratified propensity score
analysis and what this does is essentially splits the original population
that you're looking at into multiple comparable subpopulations and each of
these subpopulations now because you're stratifying on a function that's
taking into account all of their features, basically it ends up balancing the
features that were relevant for this action. So in our case, the features of
a user that we're going to be balancing on are going to be all of their past
events which is the unigrams and bigrams of all their messages that they've
mentioned in the past. And just as a, you know, it's worth noting that our
control users don't actually have a past or future. They never had a
treatment event, so we can't tell when we're going to align them on. So we
just pick a random time at which to align them. And so we pick a random time
and then their past events we're going to define everything that happened
before this random time. So that is, you know, of all the possible times
when they didn't take this action, we just pick one. Now, our tasks to learn
a propensity score estimator is to learn the likelihood that the user
mentions this treatment token. So we basically learn a function, what is the
probability that the next word is going to be this target token that we care
about. Sorry, that they're going to be -- that they're going to be in the
treatment control class given the past events. And in our case, we're
training our estimator with an average perception on learning algorithm and
we're training this algorithm based on all the timelines that we've
extracted. Now, we use this then to bin people and we can then just quickly
takes a look at whether the propensity score estimator function is doing a
good job. There's two things that we care about. One is the populations are
actually matched, the control and treatment populations within a strata
should have similar distributions of features. The other thing that we'd
like to see is that the propensity score is actually estimating things
correctly and this function shows that. And then how do we measure the
outcome in these experiments? We're treating outcomes. All of the words
that people say after they take the treatment or don't take the treatment as
binary values so this is did they ever in the future use the word, you know,
Y. And rather than mention that word or they don't. And we're going to then
measure the average treatment effect summed across all the strata as the
increase in likelihood of them mentioning some event given that they took the
treatment versus that they didn't take the treatment. Now, propensity score
analysis is borrowed from the causal inference literature but if we actually
wanted to make causal claims, we'd require a fair amount more domain
knowledge. And the reason is there's two -- you know, there's a fair number
of assumptions being made here, but there's two in particular that the
propensity score analysis makes that we don't. And generally causal in
general makes. First is that we would have to assume that all the important
confounding variables are included in our analysis. And, you know, even
though social people talk about a lot of things on social media, there's no
guarantee that they talk about everything that's important or that everyone
talks about everything that's important and so we don't meet that assumption.
And then there's also this fun assumption that the outcomes of one
particular -- that happen to one individual should be independent of whether
other people take a treatment. And in a network environment, that's
generally not the case. What happens to you depends on what other people are
doing as well. So because of this, we don't say that we're actually pulling
out causal relationships between outcomes and these treatments, but we do
find having said that, we do find that this analysis gives much better
results than simple correlations. So that's kind of fun. So evaluate these
technique it's and see whether or not they did well. We picked 39 specific
situation and so these situations were chosen from nine high-level topics.
We chose these nine domains for diversity. They include, you know, within
the business topic, we looked at construction maintenance, people mentioning
financial service related stuff or investing in health. We looked at several
mental issues, diseases and drugs essentially, and within the category of
societal topics we looked at general societal issues, law and then
relationships. This category is taken from the open directory project so
it's an existing taxonomy that we borrowed. And then within each of these
high-level topics, we chose several specific situations that we basically
picked at random from searches that people are doing to Bing already. So we
want some grounding that we were looking at questions that people cared
about. So that's why we borrowed these from Bing. The actual data we
analyzed was three months of firehose data. And so what we did was we looked
at March 2014. We looked at everyone who expressed taking these actions so
that they mentioned they had high blood pressure or high cholesterol or they
mentioned they were taking lorazepam or Xanax, trying to lose belly fat,
getting divorced, finding true love or cleaning countertop. And then given
that they mentioned this in March 2014, we grabbed all of their tweets from
the prior month, February 2014, and all of their tweets from the month after
in April 2014. Then what we do is we run our analysis. We extract the
outcomes that occur after people take these -- do these -- experience these
situations and then we also took at the temporal evolution of how people -of how -- when these outcomes occur after people take the treatment. We also
got judgments on how good our results were for mechanical Turk. And then we
also compared it to existing knowledge basis. I'll give you some examples
now of the ranked outcomes and the temporal evolution and then go into the
results of our precision judgments across the aggregation of these results.
I won't talk about comparing to existing knowledge bases in this talk. So
for example, someone mentions gout. What are they likely to mention
afterwards. The top ranked issues were the phrases that people mentioned or
unigrams and bigrams that people mentioned were basically people mentioning
flare-ups, uric acid, uric more generally I guess and fair more generally.
Big toe, joints, aged, and then at the bottom, you have bullock and you have
kind of people -- so I'm not sure whether there's a specific UK kind of bias
in this data. But most of these words you see, you can see are related to
gout, which is a disease where uric acid builds up in your body, and causes
pain in various joints. And for each of these, we can take a look at the
lift. How much more likely are you to say this. So this is the absolute
difference in likelihood of saying this so if you were one percent likely to
mention some, say, flare-up, now you'd be 5.1 percent likely given that the
difference is 4.1 percent. And then this is the Z score. So all of these
results are pretty statistical significant. Another example, people who are
trying to lose belly fat. They're more likely to talk about burning, ab
workouts, you know. And then they're more likely to mention videos and play
lists and stuff like that afterwards as well. I'm just going through a
couple of samples. If you have any questions about something specific, let
me know. If you mention triglycerides, you're more likely to talk about your
risk or statin and lowering, I assume, blood pressure, cardiovascular issues,
healthy diet, fatty acids, and things like this. If you mention that you
have a pension or you're saving for retirement, then there's a whole another
taxes and retirements and budgets and stuff that start to come out of this.
We can also, as I mentioned, start taking a look at the temporal evolution of
these terms. There's a couple I want to mention. So most things in general,
most of our outcomes we saw roughly cooccurring with the mention of the
target. So occurring on the same day. So this graph is the number of days
before and after. The treatment is the X axis and Y axis is the expected
number of tweets and each of these grouts has a different scale for that. So
I don't label the axis directly. But so, let's see -- there was one I wanted
to find here. So yeah, so for example, which one was I going to go -- oh.
Pain. So tramadol is a painkiller and so people take it. They're mentioning
pain, that's the red line. The treatment line is the people who took
tramadol. And their pain goes up but the same day that they're talking about
taking tramadol and then after they take tramadol, the pain mentions go away.
And then about a week later, they start talking about pain again. I don't
understand tramadol well enough to know why. I don't know if the course of
medication lasts a week or if this is commonly used for certain kinds of
illnesses where pain reoccurs weekly, but this is the type of thing that
we're seeing and, you know, obviously, we'd like to go in with a domain
expert and understand these better. We talk about we see people mentioning
Xanax and weed. So it's important to note that people take Xanax both
medicinally and recreationally. Apparently. And so you see people who are
taking it recreationally then mentioning weed. We have started to do some
work on trying to cluster these outcomes so we can understand there's a
certain class of outcomes that might occur for some people, maybe different
outcomes occur for different people. But this is just, I hope, gives you an
idea of the types of temporal patterns that we see.
>>
What's the weird person?
>> Emre Kiciman: Weird person? That's unfortunate. I think that's people
who are using Prozac is basically slang to denigrate someone. So they're
saying he's a weird person or they're on, you know, something like that.
>> I'm just wondering, does negation factor into this as a lot like we have
weed or we could have no weed or something like that? It's a [indiscernible]
bigrams or unigrams or, I mean, is that factored in?
>> Emre Kiciman: No, we don't. Yeah. So that's one of the types of things
that would be really great. So for example, we see in a lot of these anxiety
drugs and stress-related stuff that people do something and their stress goes
away and they're saying, oh, I have no stress now or my anxiety is much
better. And so it's actually -- this is simply just people mentioning the
issue and it could be that it's going away but they're mentioning it more or
it's actually not going away and increasing.
>> Just the mention of the event itself, whether it exists or not, doesn't
matter, just ->> Emre Kiciman: Correct. And it would be great -- that would be the type
of thing that I would want to put into the initial generation of the
timelines. When we're converting from the raw text into this representation
of what people are experiencing, it would be great to say that's not weed
instead of weed. Maybe we could pick a different example. Well, anyway,
doesn't matter. You understand. Great. Yes?
>>
So these lines are tracking the outcome mentions?
>> Emre Kiciman:
>>
Yes.
The boundary between gray and white is where the treatment mention is.
>> Emre Kiciman:
Correct.
Yes.
Sorry I didn't explain that.
>> So it's a little surprising that we're not seeing more talk about weed
early on, people who are taking both Xanax and weed recreationally.
>> Emre Kiciman: Correct. Yeah. Yes. It's possible that there's a
temporal thing going on that people -- maybe there's some change happening
between the February through April tame frame where, I don't know, maybe
people are going to more outdoor raves or something like that getting to the
spring. So it's quite possible we're seeing systematic artifacts like that.
>>
And what was the timeline?
Like in the last couple years?
>> Emre Kiciman: March 2014 is the time when we found -- when we looked for
the target events and then we grabbed a month before and a month after as
well.
>> I was just wondering if in that particular case, it's used some of the
legal changes around the country of the effect of that.
>> Emre Kiciman: Possible. Spring break could be -- I don't know. There's
all sorts of things that -- yeah, yeah. So this is only three-month. It
would be great to start applying this to really longitudinal data sets and I
hope and hope that that would start to factor out some of these things.
Yeah?
>> How confident are you that this data is representative of the larger
population of people who don't freely share their stuff on Twitter?
>> Emre Kiciman:
I'm pretty confident it's not.
[Laughter]
>> Emre Kiciman: But more seriously, so there are going to be some domains
where people are more open and more open to discussing what's going on in
their lives. There's also a lot of domains where even if most people aren't
talking about it, the ones who are having a representative experience. But
bias is a big issue here. And we don't have a good handle on the bias that
goes into generating the social media timelines. There's some things we
understand. We know people are more likely to talk about extreme events,
less likely to talk about everyday things that, I mean, I think the rule of
thumb is you know, do you feel like your friends would think this is boring?
If so, many fewer people are going to mention it. But no, we don't
understand that bias very well. There's a couple of experiments that would
be nice to get started to understand that better, but it's a really hard one.
Okay. Evaluating correctives. So here, this is where I mentioned we took a
look at mechanical Turk workers to judge the correctness of these results.
And then we calculated based on that precision at a particular rank. And
yeah, I know we mentioned I'm not going to go into the knowledge-based thing
here. So to help our judges understand whether something was correct or not,
right, so one mentions, say, the treatment is dealing with jealousy issues
and the outcome is wake up. What we'll do is we'll show them an example
tweet where someone is mentioning jealousy issues and then we say later on,
the same user, so we find a tweet by the same user, the same user later on
says, I need you to wake up because I'm bored. And now, we'll give each
Turker two to three of these pairs of messages. We'll then generate several
sets of these paired messages for each of these treatment outcome pairs, and
we'll get the Turker to say, you know, is someone who is mentioning X dealing
with jealousy issues later more likely to talk about wake up. So here, I
would have judged this one as wrong. And so -- well, actually I should take
a look and see what our workers said for this specific example. But then you
can see others suffering from depression. You know, if you think depression,
okay, and then later on self-harm is the outcome. This one, maybe. Paying
credit card debts and then talking about apartments later. So someone has
said they paid that credit card bill and then later they talk about checking
for apartments. That one, maybe they're really more likely to talk about
that. So this is the type of task we're giving to our mechanical Turkers.
And in aggregate, we found that when we sort all of our outcomes by their
rank, so the top five outcomes, top ten outcomes, et cetera, we found that
the best -- I highest rank outcomes had a precision close to 80 percent.
Where people were just judging this is relevant or not relevant. And then in
it drops from there. And we see basically correlated with where are both our
average treatment effect which is how this is ranked as well as the
statistical significance rank things. And it drops down to cumulatively
about between 50 and 60 percent. Non-cumulatively, you're dropping into like
the he low 40s to high 30s down at the tail end in terms of the precision.
We do see variants in precision across domains. So some domains are doing
better than others. So some of the law is doing really well. Health is
pretty consistent and doing just over around 65 percent or so precision.
Financial services are deemed pretty good and construction maintenance was
our worst. A lot of this change in precision is actually due to data value.
So as we get not all of our scenarios had the same numbers of users, as we
saw more users in our data sets, the precision tended to increase. I'd be
happy to talk about more things as we get to the end of the slide. I don't
know how much time we have or if I've already gone over, but so a future
direction, there's many missing capabilities in this system. We're looking
at how can we start to reason about not just binary events or binary
experiences but also continuously valid event, looking at frequency based
analyses and also just generally being more aware of time. So right now, we
ignore whether something happens immediately or whether it happens a week or
two in the future. And that makes a difference calculations. And we're also
basically going to reintegrate some of the text parsing that we had done
earlier into this analysis. I'm working with some distributed systems
researchers to build out a system that's capable of doing this analysis at
scale. Right now, these 39, 40 scenarios I mentioned were done by pooling
basically a copy of the Twitter data and for that particular scenario and
running things, you know, off line. We'd like to be able to do that much
faster and at large scale. With several really great collaborators, we're
looking at domain-specific analysis for not just individual level questions
but also some policy issues. What are people's experiences with, say,
bullying or with depression and things like that. And I'm also looking at
application ideas. How can you start to use this data to provide a nice
backing for other analyses or other kind of interfaces. To conclude, we
focused here on demonstrating how stratified propensity score analysis can be
applied to personal experiences, to extract outcomes of specific situations.
We applied this to a diverse set of domains and evaluated how well the
outcomes looked to mechanical Turk judges. And, yeah. That's it. I'd love
to talk to you more about this work and answer any questions you have. Thank
you.
[Applause]
>> One thing Twitter can give you, at least in some cases, is location.
I'm wondering if you've considered whether some of these outcomes might
differ by region or by country and if you looked into that.
And
>> Emre Kiciman: No. So exact location, you would only have for 1 or
2 percent of tweets, but you're right. We can identify kind of larger
regions, like what city or what state or country people are in for about 60
or 70 percent of tweets. We haven't incorporated that as a feature in our
analysis. At least not explicitly. So if people imply their location
through their text, that might be getting captured here somehow in the
analysis of the text, but we're not taking it into account in any other way.
But you also raise another kind of broader point which is that the outcomes
are likely to vary by individual. And we do see traces of that. So looking
at even just our stratified outcomes, we do have examples where -- let me see
if I can pull one up. Let's see. So this is from a different analysis. We
do have examples where, for example, this is a likelihood of people to
perform suicide ideations based on analysis of not Twitter data but red eye
data. And we see that if people mention the word depression, their
likelihood of later performing suicide ideation, talking about suicide goes
up and how much it goes up depends on their likelihood of talking about -of -- their likelihood of mentioning the word depression. So as the
propensity goes up, you know, if they're not going to likely talk about
depression, then if they do, their likelihood of performing suicide ideation
goes way up. If they are very likely to talk about depression, actually
using the word depression doesn't seem to make a difference. That's one
example. Yes. Go ahead.
>> In the realm of other data that Twitter gives you that would be
interesting to look at, the network effects. If someone has a high
percentage of app mentions with another user that did mention suicide, are
they more likely to take on mentions of depression or vice versa, do you
think?
>> Emre Kiciman: Yeah. I don't know very much about how the corrections you
have to do for the math and the analysis for analyzing network effects of
treatments. But yes, I mean, there's all sorts of interesting interactions.
A lot of interesting things that you can treat as being a action that you
want to analyze the outcome of. And how you get that into the timeline,
yeah, you could -- there's all sorts of ways you could do that. Yes?
>> A couple questions on the crowd sourcing. You selected the top -- you
said the top three outcomes that you had judged by the crowd sourcing judges.
>> Emre Kiciman: So for every scenario that we analyze, that we had run our
analysis for, we selected the top 60 scenario. 60 outcomes. And then for
each of those outcomes, we gave every mechanical Turk worker, we gave each
judge at least two paired messages so a treatment outcome and then user two
says treatment outcome and then assuming we had enough data, we would then
also repeat that with a different pair of treatment outcomes three times.
>>
And then how many judges did you have for each set?
>> Emre Kiciman:
I believe we had three.
But I'd have to check.
>> So it would be interesting to see whether if you exploded that out into
10 or 20, what the distribution of that would be because it seems like it
might be very subjective whether wake-up is associated with -- I forget what
the control was.
>> Emre Kiciman:
That's true.
Yeah.
With jealousy.
>> Jealousy, wake up and smell the coffee, he's not that in to you, kind of
thing.
>> Emre Kiciman:
Yeah.
>> Wake up in this particular instance, you know, the context might affect
the judgment and it might actually be interesting distributions of what
people think would be more likely to be mentioned given the ->> Emre Kiciman:
>>
After.
No.
Before?
The stratified results or the -- these?
The one with the blue and orange.
>> Emre Kiciman:
>>
Mm-hmm.
The one with the reps.
>> Emre Kiciman:
>>
Yeah.
Would you go to the slide after this?
>> Emre Kiciman:
>>
You're right.
Okay.
Sorry.
Yes.
I skipped around.
Can you explain the last one with the can't?
What's --
>> Emre Kiciman: So this was -- so this was actually -- this was results
from a different analysis that happened to be in the slide deck. Here, so in
the presentation I talked about an effects of cause analysis. So
something -- yeah. An effects of cause analysis where something happened and
we look at the effects afterwards. Here, we were actually doing a causes of
effect analysis. So we were looking for what happened that increased the
likelihood of one event. So we iterated over all the -- so before I said we
picked, you know -- looked at the outcomes Y, here we're actually looking at
the precursors X and iterating over those. And so here, all of these graphs
are showing the likelihood for people to post in a suicide ideation form.
This is joint work with some collaborators [indiscernible]. And we're
looking at basically the words that seemed to increase the likelihood of
people doing this. Yeah?
>> I was wondering if it would also make sense to see if after, you know, a
given event, you could actually cluster people into different groups that
sort of react other talk differently about that event. Right? So imagine,
you know, some -- a bunch of people taking drug X. And many of them may
experience the expected effects and then a small group there, but
consistently, may experience adverse effects. Right? So by doing that, that
spread, you could actually determine different outcomes and different groups
of people that experience things differently.
>> Emre Kiciman:
Yeah.
>> Other thing is I think in psychology, right, I mean, that will be -- it
will be important because people's perception of the world of events, you
know, is very important for their psychological state. A certain event
happens, some people might just move on very easily whereas others tend to be
very affected by it.
>> Emre Kiciman: So we've done some very basic things to look into that.
What we've done is we've starting clustering the outcomes based on the user
IDs, whoever mentioned them. And that does give us a split that, say, takes
the recreational drug users' outcomes and separates them from the medicinal
drug users' outcomes. There is that type of thing. Now, more generally,
there are methods in the econometrics literature on how to calculate
heterogeneous treatment effects where you more directly look at what prior
features seem to be important when actually calculating the effect of the
treatment. We haven't done that yet. Yeah.
>> Just a comment that I think Twitter's a real interesting source, but it
might be interesting as well to run this over medical data, like patient
discharge records that are very fact-based that talk about what state the
patient is in and subsequent outcomes. That might be very -- yeah. I think
because that type of English is typically just all about facts, where on
Twitter, seems like there might be some more noise interjected into it.
>> Emre Kiciman:
Yes?
I agree with you.
That would be pretty cool.
Mm-hmm.
>> You were mentioning that you classified the data on the ID. So some
metadata that you got from the tweeters but sometimes the metadata in Twitter
is very poor. Very bad maintained. So what did you really use as a metadata
for classification?
>> Emre Kiciman:
>>
Which classification?
You mean the --
Divided by ID?
>> Emre Kiciman: Oh, sorry. Yeah. I did not mean to mention that. I
didn't mean to imply what I think I must have said. No. So I'm looking for
the slide where I think that that might have come across. Oh, well. So what
we did was -- what we did was we did not look at any of the metadata from
Twitter. We only looked at the language. I did at one point -- I couldn't
find the slide just now. We aggregated by user ID, but we did not otherwise
do anything else. So the fact that the user might have had a particular home
page or have a certain number of followers or say their location is somewhere
or imply gender by their name was not taken into account. Yes?
>> [Indiscernible]. I'm wondering if you put any of the pairs that you
ranked low before the Turkers.
>> Emre Kiciman:
We didn't.
>> To see how they ranked those. Because I mean, some of what you're
getting is interesting. Implicit data that people may not have intuitions
about.
>> Emre Kiciman: Yeah. I would -- I mean, so -- I'm pulling this up. So
this is the cumulative figure. But in the non-cumulative, I said that these
lower ranks, the just looking at these last results and stuff, you're around
the 35 to 40 percent precision range. I would, having looked at what gets
ranked down like at the 50th result for a specific scenario, I treat that as
a control. Those start to look pretty bad.
>> And then the three judges.
yeses?
You have two out of 3 and 3 out of three were
>> Emre Kiciman: Actually we averaged. So we ask them actually on a -- I
believe it was a four point scale. We said that, you know, it's either not
relevant at all with zero and then completely relevant was a one. And then
we had probably relevant, probably not relevant.
>> You have three users and you averaged this score for the users on -- and
their score.
>> Emre Kiciman: Yes. And then we have across our 39 experiments, we have
five top five scenarios for those 39 and then that's what makes up this
distribution in this box right here. Yeah.
>>
[Indiscernible] that you had on there?
>> Emre Kiciman:
Not off of top of my head.
Sorry.
Yeah.
Yes?
>> Okay. I remember [indiscernible] talk about a control group. So it
seems that now for control group, you use everyone who is not having a
treatment. Have you tried to look at control group that look like the people
who you have the treatment but [indiscernible] treatment?
>> Emre Kiciman: So the -- come back to here. When we do the stratified
analysis, if people look like the treatment, then what will happen is that we
won't do a good job separating the treatment from the control population when
we do the stratification. And you'll get a lot of common support. A lot of
your strata will have enough people on both sides to do a good statistical
comparison. If we do a poor job selecting the control users and they
separate easily, so it's very easy to tell someone who is going to take Xanax
versus someone who is not. Say we have some systematic issue with the
sampling for example. Then what will happen is that all of these green dots
will come up to the high strata and all the red dots will be pushed down to
the low strata and then we won't actually have any comparison here. So when
we have a control group that's partially comparable and partially not, what
happens is the control group goes way down to the bottom. It's very easy to
separate them out. And then we have some support in the middle where people
are similar.
>> [Indiscernible] maybe the people in control group are so different from
people who are in treatment.
>> Emre Kiciman:
Yeah.
>> So therefore, the difference is not only about whether they have
treatment or not but about a lot of other stuff. Like the [indiscernible]
factors you mentioned.
>> Emre Kiciman: So when they are very different, then they get put in -all the control group will get put into a very low strata because it's very
easy to predict that they're not going to take the treatment and then there's
no one left in the treatment group to compare them against. So we don't get
statistical [indiscernible] that strata and we can tell that those people
shouldn't be included in our comparison. So that's something -- you bring up
an important detail. When we calculate the average treatment effect, we only
calculate it over the region of common support where we have enough control
and treatment users in a strata to give us a good result. If we don't get
that common support, then we basically end up essentially ignoring the users.
Both treatment and control who are in a strata that doesn't have enough of
the opposite person. Okay. Thank you very much.
[Applause].
Download