1

advertisement
1
>>: Okay. So the talk will be given by Hannah Hajishirzi. She joined UW last
October, and she received her Ph.D. from UIUC, and after that, she worked at
CMU and Disney Research as a post-doc. Her research interests are in AI and
machine learning, and her current research is mainly focused on semantic
analysis of natural language text and designing automatic language-based
interactive systems.
And the title of her talk is learning with weak supervision in grounded
language acquisition.
>> Hanna Hajishirzi: Thank you. Hello, everyone. Thank you for the
introduction. The talk that I will present today is about a domain adaptive
technique that we made for grounded language acquisition. And I show some
experimental results in understanding professional [indiscernible].
Semantic parsing, in general, is the problem of learning to map a sentence to
its meaning representation or to some sort of logical form that is
understandable by a machine. For instance, the sentence Essien is penalised
for a challenge in Fabregas in midfield can be mapped to something like this.
There is a foul by Essien on Fabregas and the location is in the middle of the
field. And Essien is penalised by referee, and referee gives a free kick to
Fabregas, and the location is again on the midfield.
Semantic parsing in general is a very hard problem, but sometimes it is easier
to understand the meaning of a sentence if you have access to the state of the
world. So basically, understanding this sentence would be easier if you know
that this is about a soccer game, and you have the events that are carrying the
soccer.
So basically, given this information, you are able to correspond this sentence
to the events that are carrying the soccer game. So the problem that I am
interested in solving here is basically learning to correspond every sentence
to the corresponding part in the event representation.
There have been many great work in the literature on semantic parsing and
grounded language acquisition recently, and people recently are using machine
learning techniques and they all use different sources of supervision, some use
fully supervised setting, some decrease the amounts of supervision and work in
weakly supervised setting. Some do clustering based techniques or use
2
reinforcement learning.
And all these people have achieved great performance in many domains, in
particular, for grounded language acquisitions, the domains of interest are
robocup soccer commentaries, the instructions, including all sorts of
instructions like Windows help instructions and navigational instructions and
so on, and also understanding weather reports.
And some of these approaches are fine tuned towards a particular domain, but
some are domain adaptive and can work well on all of these domains.
Now, the question is what if we moved to a more challenging dataset, like when
we have more complex sentences and the world state looks more complicated. But
dataset that we are using for this project was basically professional soccer
dataset, where we have English sentences that are commentaries generated by
professional commentators and the text is transcribed, and the events are the
sequence of events that are carrying the soccer game.
So every event has a type
many sorts of events, and
the player, the team, and
location of the player or
so on.
associated with it, like pass, tackle, dispossess and
every event has some arguments, like the qualifier,
also many more that I haven't listed here, like the
the time -- the location of the zone of the field and
As I said, the goal is to map the sentence to the corresponding event. So let
me show you the challenges that we encounter in this domain. So there are -there are some challenges that still exist in the previous domains, but they
are more difficult to handle than in this setting.
First of all, building a one-to-one supervision or one-to-one alignment for
training data will be very expensive. And the reason is finding the segments
of the text and finding the corresponding events in the game and map them
together will be very difficult and also very subjective task.
So following previous work, we decided to use weak supervision in terms of for
every sentence, we use -- we build an ambiguous training data by mapping that
sentence to the events that are carrying the temporal vicinity of that
sentence. And so as you can see here, we have an ambiguous training data. So
this problem existed in the previous domains as well. However, here because we
have much more events in the world state, the ambiguity will be harder.
3
The next challenge is that the sentences are much longer, and also they have
complex structure and a lot of paraphrases. Because these are professional
commentators and they wanted the text to be more exciting, for instance, for
just representing this pass event, they have used many different phrases. Like
the song began the move. That means there is a pass by Song. Exchange with
Wilshere on the edge of the box, which is talking about these two passes, two
consecutive passes between Song and Wilshere and also before stealing the ball
off Fabregas foot, which is talking about a very short pass from Fabregas to
Song.
Or just to represent the goal event, they call it with this sentence, sliding
it home with his left foot from ten yards out. As you can see, there is no
mention of goal or something like that when they describe this event.
Another problem is there are many events in the world state that the
commentator won't talk about. And in particular, in the data, this is very
severe, because we have an ambiguous training data, and the commentator only
talks about some of the most important events in the domain. So it's much
harder to come up with which events to select.
Or there are some sentences that do not map to any part of the world state,
like the commentators talk about some statistics or game analysis. For
example, they say when a fantastic goal, and it's not aligning with any events
in the game. And another challenge is that the text a part of the text usually
refers to a combination of events that are carrying the world state. For
example, the commentators define these two past events together and call it
exchange with Wilshere at the edge of the box.
So this is a simpler kind of macro event. And in a more complicated scenario,
the commentator says something like arsenal is coming forward, meaning that
they're talking about a sequence of past events that were from the back of the
field to the front, and they just call it with one sentence, arsenal is calling
forward.
So now our goal is to solve all these
sentence to the corresponding meaning
goal is for every sentence SI and the
find the best macro event inside that
problems and learn to correspond every
representation. Or more formally, our
bucket of events, V of SI, our goal is to
bucket that has the best correspondence.
4
To be able to ask these questions, we need to consider two things. First of
all, how do we measure the correspondence between a sentence and event, and
also how do we rank them or basically how do we try to combine some of these
events together to find out what is the best macro event.
So let me walk you through this and see how do we answer these questions.
First let's go with the easier question. How do we measure the correspondence
between one sentence and one event. Forget about the macro event scenario.
Really, how do we distinguish between the cases like this, Chelsea looking for
penalty as Malouda's header hits Koscielny. This is the same sentence, and it
is paired with two events. One is the pass as the event type and the arguments
are head, Chelsea and Malouda, and the second one, the event is foul, but the
arguments are head, arsenal and Koscielny.
How can we say the pattern of correspondence in one of them is different from
the other one? And how can we say that one pattern of correspondence is better
than the other? So, for instance, here no pass events [indiscernible] so this
pattern of correspondence should not be a popular pattern. So we have this
intuition that a pattern of correspondence would be good if it occurs
frequently among the data.
So, for instance if I see some sentences like this with a foul event, I would
say this is a popular pattern. Therefore, our goal is to look at the pattern
of correspondences, all the pairs basically, pair every sentence with all the
events in the bucket, and look at this global information. Find which of these
patterns are more popular.
But to be able to find out the popular patterns, we notice that we really can't
say these two patterns are identical to each other. Therefore, we need
somehow, some way to compute the similarity between pairs.
And notice that the pattern of correspondence is really different than
similarity. When I'm saying similarity, I'm interested in similarity between
two pairs. And pattern of correspondence is actually inside one pair.
So our solution is to rank the pairs according to the popularity. But we need
a similarity measure to be able to find the similarity between pairs.
So let's see how can we say these two pairs are similar to each other or not.
One idea is look at this sentence in terms of the bag of words and then just
5
compute their cosign similarity or some -- the known notions of similarity.
But as you can see, they might be high in similarity for these two pairs,
because they share a lot of things among themselves.
But what we really care about is the similarity between these patterns of
correspondences. For instance, here the co-occurrence of words really matter.
For instance, you see the word header probably is a good match for head, or the
team name, the player name. But here, you don't see anything about the team
name. The player name is aligned, but you see some words like header, hits, or
penalty which probably a good match with respect to the foul event.
So therefore, we want to say that this pattern is really different than that
one. But the known notions of similarity won't be able to capture this. To be
able to model this, we build a new notion of similarity or a discriminative
notion of similarity in this way. We want to discriminate this pair from all
the rest. Basically, we want to say that this -- we want to learn the
particular pattern of correspondence of this pair from what it is not, rather
than they're similar ones. Because generating negative examples is much easier
for us.
And to that purpose, we used a technique that is recently very popular in
computer vision, which is the idea of exemplar SVM, which tries to learn one
instance based on what it is not, rather than what it is. And the idea is we
put one pair as that positive example. We basically learn SVM is pair, pair.
And this example, this SVM gets one positive example and a lot of negative
examples.
And we artificially build negative examples by the [indiscernible] that we are
sure they will be different than the original pair. So we say we select the
sentences that are similar to that one, but the events are different and vice
versa. The sentences that are different from the initial sentence and the
events are similar.
Then for every pair, we build a classifier which is basically assigning a
weight vector to all the features. And for features, it's simply use bag of
words, just represent a sentence with simple bag of words. And the events are
the event [indiscernible] all the arguments.
Now, we can easily define how two pairs are similar to each other or not. For
that purpose, we say two pairs are similar if, when we apply the model of one
6
pair on top of the features of the other one, it should be in the positive set.
And vice versa. This way these two pairs are similar. And we defined our
similarity according to some function applying the classifier of one on top of
the other feature and vice versa.
So far, so good. We are able to define some notion of similarity between
pairs. And just to show you some qualitative results, look at this sentence
that we were talking. The highest weight elements that our approach returned
for this pair was actually something that we expected. Penalty, hits, foul,
header. And we also had some relatively high score to these words as well.
And the pair that was the most similar to the original pair was this one. Alex
Song is too strong for Essien, who goes down looking for a foul. Versus the
event foul, the arguments are long ball, Chelsea and Essien. So as you can
see, the sentence is very different, the event is different, although they are
both talking about the foul event.
So as I said, our goal is to rank a pair inside the global domain, inside all
the pairs that we have in the domain. To that purpose, we build the underlying
structure of the pairs by using the similarity metric that we define. So
basically, we build a graph. Every node in the graph is one pair, and the
nodes are connected to each other if they are similar to each other.
And now, our goal is to rank the pairs according to the popularity in this
graph. So what is a popular pair? We say a pair is popular if there are a lot
of other pairs that are popular or similar to that pair. And this recursive
definition is very similar to the page rank idea, that the page is important if
there are many other pages that are important has a link to that page.
So we basically use a random-walk algorithm which is very similar to the page
rank algorithm, and we wanted to compute the importance or the popularity of
each pair inside this graph. So what we do is basically use this formula, and
we look for every node or every pair, we look at the neighbors and update the
score of this pair according to the importance or according to the popularity
that it gets from its neighbors. And we continue on until the scores converge.
Okay. So just to summarize what we have done so far is that we build all the
pairs that were possible in our domain we modelled our correspondence and used
discriminative similarity and page rank algorithm and ranked the pairs
according to their popularity. Then we project these ranking inside each
7
bucket so now, if we knew there is only one event for every sentence, we would
be fine, because we said okay, the best, the highest ranked pair is the
[indiscernible], it's going to contain the event that we are interested in.
But you remember, there are multiple events corresponding to every sentence.
So we are looking for macro events. How do we do that? Can we go and search
all the possible combinations and just say this macro event has the highest
score? It is possible, but it is not efficient. It's not feasible, really.
Because as I said, we have highly ambiguous training data, and the size of
these buckets are pretty large.
But we still have one other problem. Initially, we would like to compute the
correspondence between one sentence and a group of events. But what I've
already described was only measuring the correspondence between one sentence
and one event. So for that purpose, we really need to bring macro events into
our system.
But the idea is very simple. We want to build a new model, a new pair model
for basically macro pairs, the pairs that include one sentence and macro event.
And also, we want to be able to rank macro pairs according to this new model.
So how do you build a macro pair model? Again, we use the exemplar SVM idea,
and we want to say what is discriminative about this particle or macro pair?
And we put that macro pair in terms of the positive examples and then build the
negative examples as I described before. And now, we have a model for this
macro pair.
Then we can easily define the similarity between different macro pairs. But
when we want to apply the pair model on to another macro pair, how do we
define -- how do we compute the similarities scores as follows? We apply the
model on top of each of the features of the pairs inside the macro pair and
just output the highest score.
So now we are able to measure the correspondence between one sentence and a
macro pair -- and a macro event. So I can easily go and rank them accordingly.
But again, this is a search in a combinatorial space. How do we solve this
problem which looked at the measure, actually the ranking function that we were
describing?
So this ranking function has a nice property that it is a sub-modular function.
8
Therefore, one greedy approach would help and would give us a good
approximation of returning the best macro event. And our greedy approach is as
follows. So we first select the highest ranking event according to our
original pair rank algorithm, and then at each iteration, we add the event to
the -- we always update this macro event.
And at each iteration, we add one more event according to the gain that we get.
So we always like to maximize this gain that we are receiving. And we Pete it
until K iteration and until we don't get more gain.
So basically, this is the essence of the work that we did, and we experimented
the algorithm on the professional soccer commentary dataset that I was talking
about. So for evaluation purposes, we build gold standard by learning every
sentence to the corresponding events, and we did that for about eight games.
We have around 900 sentences and about 15,000 events. In total, we have about
40,000 pairs, and on average, we have 40 events per bucket.
In gold labels, we have almost about a 1,400 pairs that are right. And this
number is very small versus the original number of pairs, and the reason is
there are a lot of sentences that there were no corresponding event in the
sentence, because the commentators were talking a lot about the game statistics
and those things.
So for comparisons, we compare our approach with some base lines. In
particular, we want to look at each of the components of our system. What if
we replace the similarity metric with a cosign similarity or if we remove the
ranking algorithm with simple counting algorithm. And also, we compare it with
previous work, basically the first work that compare with was the work of
Liang, that he has a generative approach that is also domain adaptive and works
well in learning to align sentence to events.
And also, with state of the art instance level, multiple
And the idea is of multiple instance learning is because
similar to the multiple instance learning setting, where
can resemble the bags in multiple instance learning, and
bag at each time.
instance learning.
our setting is very
each of these buckets
you have one positive
So here are the results that we get for all the games. So these are all the
game halves and this is our method, and this vertical line shows the F1. As
you can see, we were consistently higher versus all the previous methods and
9
base lines, although still the performance is not high.
Around 48, 50.
We also made some comparisons, the average comparison in terms of F1, AUC and
precision and recall. With the previous approaches, again, the red one is us.
And also, with the baselines.
We also studied our macro event finding scenario. As you can see, when we add
macro events to the system, our performance increases. So this shows the
cardinality of macro events, and this is the F1. And as you can see, after
about four or five iterations, we really didn't make more improvements.
And here are some qualitative results. This sentence is Cole is sent off for a
lunge on Koscielny. It was poor. It was late. But I'm not entirely sure that
should have been red. So this is talking about a foul event from Koscielny
on -- no, on Cole, yeah. And it is talking about a red card but never
mentioned -- hasn't any mention of card. But our method could capture that
there is a foul event and this is a card event.
Another thing is first attack for Drogba, out-muscling Sagna and sending an
effort in from the edge of the box which is blocked, and Song then brings down
Drogba for a free kick. So it says Drogba is going to attack so takes on, and
then out-muscling Sagna. So Sagna is challenging, and sending an effort in
from the box, which is blocked. So there is a save event here. And Song then
brings down Drogba, so there is a foul by Song. So this is the last event.
And this one, two poor efforts for the price of one from the free kick as
Nasri's shot hits the wall before Vermaelen fires wide. So it is two attempts.
So the first one, it is just the attempt was saved because it hits the ball.
And for the second one, the Vermaelen misses it. So basically, there is a bad
shot.
We also tested our method on one of the benchmarks in grounded language
acquisition, and this is our performance after compared to the previous ones.
So to summarize, I introduced you a method which was domain adaptive for
learning, to learn sentences, to corresponding event representations, and the
main idea was to build the underlying structures of the pairs and then rank the
pairs and use this global information and find out what is the best one. And
we improved the state of the art for these tasks.
10
So to continue, what we are currently doing is basically applying this same
techniques in multiple instance learning scenario. As I said, the setting of
the problem is very similar to ours so we have already achieved very good
results when we apply same techniques in multiple instance learning.
And I also like to continue this on automatically generating stories, maybe one
example would be automatically generating commentaries, given the game events.
As I said, we can handle macro events, but there's still something like Arsenal
is coming forward, our approach cannot capture that, because they are really
hard, they require a lot of reasoning, inference, planning, a lot of more
things.
And finally, we like to do event recognition and go beyond the grounded
language acquisition, do event recognition without requiring to have access to
the event log of the game. So thank you. I would be glad to take questions,
sure.
>>: This is a wonderful talk and very clearly presented. So thank you. And
the phrase grounded language acquisition suggests that you learn something
about language that could then be applied in some way. The future work ideas
there are similar. I'm wondering if you could say a few words about how you
might take what the system has learned and use it to do something like generate
commentary or maybe even, more simply than that, given commentary and no event
logs, find the sentences that do correspond.
>> Hanna Hajishirzi: That's a very great question. So I have some ideas how
to do commentary generation. So the ideas that I have, so first of all, notice
I have this long sentence and map it to a group of events. But still, I
haven't done any segmentation, saying that which part of the sentence goes to
which words, right. So this is something that I am thinking about it to solve
this. Then I will have a bigger dataset, right. Shorter phrases and kind of
corresponding events.
Then my main goal is to be able to look at these events, kind of and doing it
vice versa, right. So I have generated this one-to-one correspondences, and
then from every event, I would like to generate a sentence -- maybe generate
some templates and then do commentary generation. Or maybe apply some
techniques from reinforcement learning to be able to decide if you want to
report that sentence right now, or you would like to combine it with previous
sentences, or maybe it's time to wait for more sentences and then say
11
something.
So these are some things that I'm thinking about. But the question about how
can you understand this without having access to the event log, that is a very
hard question. So some ideas that I had was what if I had some information
about the domain knowledge, or I know this is a soccer game, and I have some
information about the types of events, the meanings of events and so on, so I
might be able to take that into account.
Or another possibility was kind of use the results of this approach, learn
these patterns of -- use these patterns of correspondences and later apply it
on the new sentences that I am seeing.
So this is a very hard thing to do, but I have some very high level ideas.
>>: So it sounds like one of the main advantages to your approach, instead of
handling multiple events [indiscernible] so does that show up at all in the
robocup data, or do they sort of simplify it?
>> Hanna Hajishirzi: So there are two then, in robocup data. For every
sentence, you really don't have multiple events corresponding to that. So this
is the assumption that there is always at most one event. It might be zero,
but most -- but it couldn't be more than one.
>>:
I see, okay.
>> Hanna Hajishirzi: But in the better report, that's true, you might be able
to handle the group of events. But what do they do is basically, again, so
there was this work of Percy Liang that he did segmentation, but for every
segment, he aligned it with exactly one event. But still, he couldn't handle
this case that for every part of the sentence, you can map it with multiple
events. Or there are some other techniques that they might just use a
threshold and then say, okay, if it is great, the similarity or this
correspondence higher than some amount, you just say this is the events that we
can report.
>>: So you also seem to be learning some really cool paraphrasing information
there.
>> Hanna Hajishirzi:
Yeah, actually that's something that I talked about.
12
That's true. The reason is because we are doing, kind of looking at this kind
of parallel [indiscernible] thing so we might be able to capture paraphrases,
except we don't have enough data. I mean, we have enough data, but we don't
have enough kind of annotated data to see if we got it right or not. So we
might be able to annotate more things and see if it's working or not.
>>: I was wondering if those errors were coming from just not having enough
paraphrase information or if there was just a lot of indirect missing language
that it's harder to [inaudible].
>> Hanna Hajishirzi: I would say because I really didn't have enough data, I
really can't answer your question. But I think there are -- most of -- not
some of it. At least half of it is because of the paraphrasing. So you see
it's talking about a [indiscernible] ten yards out. So this is not at all
similar to a goal event. So it is very hard.
And one other thing is I am using very rich features in terms of the language.
So this is only bag of words. I even don't consider bigrams so you might be
able to improve at least in that part. Yes?
>>: I wonder so for machine translation you have [indiscernible] two-word
translation [indiscernible].
>> Hanna Hajishirzi: Exactly. It is very similar. I didn't do any
comparisons myself, but Percy Liang in his work actually compared with the IBM
models for machine translation, and his approach was much, much higher than the
machine translation techniques. I didn't add comparisons here, actually.
>>:
But do you feel --
>> Hanna Hajishirzi:
Thank you.
There are very certainly similarities, that's true, yeah.
>> Michael Gamon: Okay. We're going to the second talk now. It's a pleasure
to introduce Munmun De Choudhury. She's a post-doc, and she got her Ph.D. in
sunny Arizona from Arizona State and she's doing a lot of work in computational
behavioral science, especially with respect to lately with respect to health
related issues and looking, also looking a lot at social media. And I think
that's a really interesting combination and we'll hear some of the latest work.
13
>> Munmun De Choudhury: Thank you, Michael. Hi, everyone. Good afternoon.
I'm very excited to be here today and I'm going to talk to you about some of my
recent work on how we can mine online behavior from different social platforms
and how it can be leveraged to improve health and wellness.
This is a joint work with Michael, who is sitting over there, and other
colleagues at Microsoft Research, Scott Counts and Eric Horowitz. So we are
truly living in an information era today, whether it is we want to seek
information about our favorite celebrity or our topic of interest, or it is
sharing information with our friends and family and audiences about big and
small happenings around us.
In fact, one in six people in the
As we constantly usual enter into
sense it is providing researchers
an unprecedented scale, which was
world today is an active user of Facebook.
these information-rich environments, in some
a whole new tool to analyze human behavior at
not possible before.
A key aspect of many of these online platforms, such as Twitter or Facebook, is
that people use these platforms to share reports about really big, global
happenings. For example, the riots in the Arab springs, the earth quakes in
Haiti or, more recently, the presidential elections.
A not so talked about aspect of these platforms is also that people use these
tools to share their opinions, feelings, and their thoughts around a wide
variety of really personal happenings. For example, the birth of a child in a
family, the death of a loved one meeting with a traumatic accident, moving to a
new place and so on.
In some sense, what we are seeing is that social media tools and social
networking sites such as Twitter or Facebook in some sense act as a whole new
lens or a window into understanding people's behavior around these big life
events. And the reason that this kind of research is interesting is beyond
elucidating our core aspects of human behavior at really large scales, it can
enable us to identify concerns or issues that may arise in our behavior in a
longitudinal manner over a course of time.
For example, mental illness. If we talk about mental illness, it is a very
serious challenge in personal and public health today. More than 300 million
people in the world are affected by depression and it is also responsible for
the more than 30 thousand death by suicides that happen in the United States
14
every year.
What makes things even worse is that a lot of these statistics are actually
under reported. CDC and the National Institutes of mental health, they often
conduct surveys, usually once a year or sometimes once in several years in
order to estimate levels of depression in populations. However, these kind of
approaches lack temporal granularity. And what we are going to talk about
today is how we can leverage online activities and specifically what people are
saying and doing on social media as a potential tool to address these
situations. More specifically, as a tool in behavioral health.
And we are going to do that through two different problems. In the first
problem, we are going to examine how social media can be used as a mechanism to
understand the behavioral changes in new mothers in the postpartum period
compared to prenatal period. And why is that research interesting? If we talk
of applications, this kind of research can lead to the design and deployment of
low cost privacy preserving intervention programs or early warning systems,
probably, that new mothers can use in order to receive timely and appropriate
assistance when things may go out of the ordinary, maybe even to seek social
support from their friends and family and, in general, to improve their health
and wellness in postpartum.
A key challenge of this research was that how do we identify these new mothers
in a specific social platform? So we chose Twitter, and in order to tackle
that issue, we first start with identifying different birth events, based on
the postings in Twitter. We moved to online newspapers and looked at the kinds
of announcements that people made in order to post on the birth of their child.
We [indiscernible] several key phrases like you see on the slide, and what we
did was after that, we went on to Twitter firehose, which we have access
through, through a contract with Twitter, and looked at the postings on the
firehose in order to find which postings could be indicative of a birth event.
The authors of those postings helped us construct a sort of a candidate set of
people who could be new mothers. But, of course, that set had noise. Plus we
eliminated some of the individuals who were not female users based on a gender
classifier. And after that, we used a crowd sourcing strategy to come up with
more precise set of people who are likely to be new mothers reporting on the
birth of their child. After that, we again went back to Twitter and collected
several thousand postings from each of these mothers over a five-month period
15
in the prenatal period and a five-month period in the postpartum.
So how do we measure the behavior of these new mothers or maybe even how the
change in postpartum compared to prenatal. We defined four different
categories of measures. The first one being activity related. For example,
what is the volume of the postings over the course of a day, what is the degree
of social interactions that they're engaging in, which could be estimated
through the actual replies that people post on Twitter. What kind of questions
they asked and the numbers of links they shared.
We defined two different measures of ego network, inlinks and outlinks
corresponding to the number of followers and followees on Twitter. There were
four different measures of emotion that we used. The first two were positive
affect and negative affect that we measured using the cycle linguistic resource
LIWC, and activation and dominance, which measures the intensity and
controlling power, respectively, of emotion and we used another lexicon called
ANEW in order to estimate that.
Our final measure of behavior was linguistic styles. And styles are a
mechanism, they act as a proxy to understanding people's behavior and social
and psychological environments. And we again used LIWC's 22 different
linguistic styles in order to estimate the behavior of these new mothers.
Let us talk about an empirical study of how these mothers change in their
behavior in an aggregated manner in the postpartum period compared to prenatal.
We also compared the behavior of these mothers with a background cohort which
is a randomly sampled set of 50,000 Twitter users posting in the same time
period as these mothers, and who had no indication of having given birth to a
child in that period.
We notice from these various charts, each of which correspond to a behavioral
measure, that the mothers who are shown in the red line, they actually undergo
quite a bit of change in their behavior in postpartum period, which is the
right side of the blue line. For example, the volume of activity seems to go
down. Positivity also goes down, whereas negative affect goes up.
Activation and dominance, which measure the intensity and controlling power of
emotion, both go down together. And the use of first person pronoun seems to
go up.
16
However, presumably, there are some mothers who changed more than the others,
right? And being able to identify and track who those mothers are and in what
way they're changing can have implications and identifying concerns in their
behavioral health.
We moved over to individual-level comparison for the purpose. What we see in
the slide are heat maps corresponding to two different measure, positive affect
and activation. And we -- this is an RGB heat map. So blue corresponds to
lower values. Red corresponds to higher values, and green and yellow,
everything in between.
So we notice that our conjecture is actually true. So for several mothers, we
noticed that positive affect and activation seems to go down in postpartum,
which is the right side of the white line, compared to the prenatal period. We
wanted to statistically quantify the degree of change of these new mothers
across all different behavioral measures.
We computed effect size measurements, small, medium and large effect size using
coherence T. And this is the result that we have here. We notice that for
quite a few mothers, about a quarter in our sample, we noticed that they
actually showed lower activity in an extreme manner. This is kind of
intuitive, because probably a lot of these mothers are so overwhelmed with the
new responsibility that they don't have enough time to turn to social media and
do posts.
However, if you look at the number of mothers who show large effects as change
for emotion, we notice that they're smaller in mother. On tracking the
behavior of these mothers, across all different categories of the behavior
measures that we are considering here, we noticed that these mothers show
consistent change, extreme change in their behavior across all the different
measures we are considering here.
If you want to look at the behavior of the kinds of postings that these extreme
changing mothers are making in a qualitative manner, we'll notice that these
mothers with large effects are posting about feeling lost, which basically
indicates they're feeling helpless. They also talk about loneliness and
insomnia, sleeplessness, and even use self-derogatory phrases such as worst
mother or horrible monster mother and so forth.
That doesn't seem to be the case with the mothers with small effects as
17
changes. That is, the mothers only change a little. In fact, they seem to be
pretty excited and seem to use these social media platforms in order to derive
informational benefits on different questions they may be having around
bringing in a newborn from their community on Twitter.
We wanted to quantify these differences in the use of different unigrams across
the cohorts of mothers. We notice that for the mothers who are showing extreme
change, more than a quarter of the unigrams in their postings are showing
statistically significant change in postpartum period, compared to prenatal,
based on a T test. However, those numbers are relatively smaller for the
mothers with small effect size change and actually much smaller for the
background cohort.
So what are these top changing Unigrams like? If you look at the table below
and the slide, we'll notice that a lot of these unigrams for the mothers with
large effect size changes are emotional in nature. And, in fact, the positive
affect unigrams seem to go down in usage after a child birth, whereas the
negative affect one seems to go up.
That is not an attribute we notice in the other two cohorts. On those lines,
we wanted to identify what is the span of language that makes these mothers
truly different from the rest. For that purpose, we devised a strategy which
we called the greedy difference analysis. What this does is it's an iterative
strategy in which we start from the top-changing unigram for the mothers with
extreme changes and keep eliminating one unigram at a time.
At every iteration, we compute the distance of the unigrams, broadly the
language that they're using with the other two cohorts. We find two really
interesting findings there. We notice that with the elimination of a little
over one percent of the top changing unigrams, these mothers with large effect
size change become similar in their language use to the mothers with small
effect size changes. And for the elimination of a little under 11 percent,
they actually become similar to the background cohort.
What this tells us is that there is actually a really narrow span of language
that is making these mothers different from the rest. And this gives us hope
that this kind of discriminatory attribute can possibly be leveraged in a
prediction framework where we utilize data of people's -- of mother's activity
over the prenatal period in order to predict ahead of time who is -- who are
going to show extreme changes in their behavior.
18
For that purpose, we came up with a prediction framework. First of all, we
expanded our data collection set to close to 400 mothers. We used a binary
classification framework in which we intended to predict the labels of two
classes of mothers. The first one corresponding to extreme changing mothers,
which is when the behavior of a mother along a certain measure surpasses a
certain optimally chosen threshold and standard changing mothers. And we used
support vector machine classifier with a radial basis kernel for the purpose.
We first trained a model simply based on prenatal data spanning over three
months before the birth of a child of a mother and tried to predict their
directionality of change in their behavior three months after child birth. We
summarized the performance of the model in this slide. We noticed that we're
doing pretty well. We have mean accuracy of a little over 71 percent, with
pretty good precision and recall.
And if you look at the ways behavioral measures, linguistic styles seem to
perform pretty well, especially pronoun use by the mothers and if you talk
about emotion features, we notice that negative affect and activation are good
predictors.
We summarize the performance of that model based on the [indiscernible]
characteristic curve you see on the right of the screen. As you can figure
out, there is room for improvement. And we wanted to explore that are there
behavioral cues in the initial one to two weeks after the birth of the child
that could be leveraged to figure out how these mothers are going to change
later on. So we trained a second model which used the three months of prenatal
data along with an optimal training data which we derived using expectation
maximization spanning one to two weeks after the birth of the child. And we
attempted to predict the directionality of change of mothers three months after
that.
You'll see from the [indiscernible] curve on the right side that we actually
improve in our performance, and our accuracy goes up to 80 percent.
So what are the implications of this research? We have been able to identify a
set of 14 to 15 percent mothers for whom we see really extreme behavioral
changes in postnatal period. For example, their level of activity and social
media goes down. Their positive affect seems to go down as well, whereas
negative affect goes up. They show increased levels of use of first person
19
pronoun, which is associated with high self-attentional focus.
And a lot of these changes are actually pretty consistent over the entire
postpartum period, and with the help of a prediction framework, we have been
able to use prenatal data and initial data over a period after -- immediately
after child birth in order to predict some of these changes.
If you look at psycho linguistic literature, we'll notice that a lot of these
behavioral markers are actually associated with depression. In fact, the 14 to
15 percent of mothers that we identified to be suffering -- exhibiting extreme
behavior, it's interestingly, it aligns with the reported statistics of
postpartum depression in the United States, which is a mental concern that
arises in some mothers right after child birth and it's typically pretty
underreported.
This research gives us hope that utilizing people's activity, their language
and emotions on social media, we may be able to devise unobtrusive behavioral
markers of mental illness such as postpartum depression in this case.
And actually, the possibilities are not just limited to what we can do with
social media data in terms of behavioral health. In a second problem, we are
going to talk about how we can go beyond postpartum depression predictors and
look at other mental illness, a common form which is called major depressive
disorder.
A challenge in a lot of this type of research is that how do we go about
collecting ground truth data? That is, people who actually were diagnosed with
clinical major depression or show signs of that or vulnerability of that. We
adopted a crowd sourcing strategy for that purpose. What we did was on
mechanical -- on Amazon's Mechanical Turk, we showed crowd workers a
standardized depression survey, which is the CES-D and we also asked them a few
questions about depression history, if they were having one.
If a crowd workers liked to, they could even opt in and share their Twitter
user ID with us. In this manner, we obtained a little under 500 individuals
through the crowd sourcing study, and we went on to Twitter and collected their
postings spanning a year period, which is several thousand postings before the
reported onset of depression that they told us.
Let us take a look at the behavioral differences of the depressed class and the
20
non-depressed class. On the left, we have -- we show the patterns of the
postings on Twitter. And what we notice is that for the individuals who were
reported to be suffering from major depression, they show raised activity after
hours in the night and actually much lower activity during the day.
In fact, if you look at clinical literature, we'll see that eight out of ten
people suffering from major depression actually have symptoms of insomnia as
well. We see differences across a number of different behavioral measures,
like we observe for the new mothers.
For example, for
number of social
affect is higher
depression terms
the depression class, volume seems to be lower and so is the
interactions as measured through replies on Twitter. Negative
in the one-year period that we looked at and the use of
is also higher.
Let us look at some of the depressive language that these individuals are
using. We'll notice that in their postings, a lot of these users are talking
about their symptoms, such as fatigue, headache, nervousness, that they're
suffering from. They often use social media as a discourse mechanism, because
they want to seek some sort of social support or connect with individuals with
similar psychological experiences.
They actually have pretty detailed information about their treatment. For
example, they talk about antidepressants and even levels of -- and even dosages
of different drugs, as you can see with, like, 150 milligrams, 40 milligrams
and so forth.
And broadly, they actually also talk about relationships and life in general
with, actually, pretty strong emphasis on religious thoughts and religious
involvement. A lot of these markers, if you look at clinical and psychiatric
literature, will see that they are correlated with precipitators of depression
and symptoms of depression.
Next we trained a support vector machine classifier like we did before in order
to predict which individuals are vulnerable to depression based on the
year-long period of their social media postings before the reported onset. We
used several different measures corresponding to several different features
corresponding to the different behavioral measures we are considering, such as
positive affect, activation, volume of posting and so forth as measured through
frequency, variance, their momentum and entropy over the year-long trend that
21
we are having.
The performance of that classifier is shown in the ROC curve on this slide. We
get pretty good traction of accuracy up to 70 percent using this model, which
gives us hope that it's probably possible to predict before the reported onset
whether or not someone could be depression vulnerable.
Next we wanted to use these insights and come up with some sort of an index,
which could give -- lend us insights into population scale levels and trends of
depression. At some point earlier in the talk, I mentioned about the
limitations of surveys by CDC and NIMH, because they don't have temporal
granularity, which makes it really difficult for governmental agencies or other
people to implement effective mental health programs.
So a mechanism or an index that relies on social media and could estimate
levels of depression in a fine grain manner over the population could
potentially be really useful.
In the slide that you see here, on the left we show a heat map showing the
actual levels of depression in the 50 states of the U.S. As reported by CDC,
and NIMH. On the right side, we show the same thing, however the levels of
depression here on the right are measured using our social media depression
index.
And if you'll see, interestingly, we have pretty good correlation of up to 0.5
for our prediction, predictive measures with the actual measures.
We also investigated a lot of other behavioral trends associated with
depression. So, for example, we found that women tend to have greater
incidence of depression compared to men, a fact which is also supported by CDC
data. We notice that across different very high incidence of depression
cities, we are doing a pretty good job in terms of correlation.
We notice that if you look at the diagonal trend of depression for both men and
women, it seems to be depression is higher at night than during the day. And
there also seems to be some sort of a seasonal component to depression, in that
cities in the U.S. with more extreme climatic conditions seem to have greater
depression during the winter than during the summer.
So weaving together the observations from the two studies that I just
22
discussed, I'm going to highlight the possibility of using social media as a
tool in behavioral health and being able to do some kind of predictive health
using such tools. And thereby makes a transformative impact on a range of
health services in terms of making healthcare available to individuals anytime,
anywhere and at a greater large scale.
And [indiscernible] actually go beyond behavioral health conditions. We can
probably begin to reason and track and even predict about other kinds of health
conditions such as autism, diabetes, or the obesity epidemic using social media
behavior. I believe that social media, with the rich activity that we can
observe what people are saying in their linguistic expression or their natural
structure or the connectivity with their friends and family has a lot to offer
in being able to make a difference to this domain.
With that, I come to an end of the talk.
any questions.
Thank you, and I'll be glad to take
>>: [indiscernible] some people are less likely to report depression maybe men
versus women or something like that. Maybe just different groups of people are
not on social media. [indiscernible].
>> Munmun De Choudhury: That's a very good point. For men and women, we did
control for that in terms of women tend to be more expressive emotionally than
men. But some of the other things which are really valid but which are
actually quite hard to do. For example, you can think of SES, socioeconomic
status to have some kind of effect but which is pretty hard to judge on
Twitter. I think that these are very valid points. But because we're looking
at fairly high scale data, large scale data, so hopefully we are looking at the
head of distribution and most normal effect and the reporting [indiscernible]
are hopefully the tale of the distribution. That's a really valid point and it
kind of runs through any social media research that it tends to look at
behavior.
>>: I guess I'm wondering, because it seems like there are, especially in the
case of the new mothers you're looking at, two different reasons that you might
see people's emotion and behavior change. One being something like postpartum
depression, where we're just not dealing with normal problems well. And the
other where they're dealing with some sort of brief [indiscernible] stress.
Still working, or baby in the NICU or something where emotional response or
drop off in activity would be normal to some extent. [indiscernible].
23
So when you're looking at all about how you can distinguish between people's
whose behavior has changed because of some [indiscernible] that would justify
that as a short-term problem in their lives versus people who were just not
coping well.
>> Munmun De Choudhury: Yeah, yeah. Actually, that's a very good point,
because I think a very strong predictor of postpartum depression is any kind of
mishap that may happen in people's lives just after -- during the pregnancy
period or when the child is born. Usually is a very good predictor of
postpartum depression.
But I think it seems that some of the indications that we notice across the
range of behavioral features, they are kind of more correlated with kind of
some kind of a mental, psychological problem that they may be experiencing than
kind of an artifact of their circumstances.
And I think a good way to validate it is actually reach out and do a kind of
more ground rich study on people who are diagnosed with postpartum depression
and trying to see the differences. And currently, we are actually working on
that. We are getting some ground truth data and trying to see if these would
correlate with postpartum depression or could be another reason that people are
behaving in that manner.
>>: I just have a general question. Basically, how would you actually go
about doing something with the results of this, right? So identifying symptoms
of users of Twitter feeds is actually, really, really interesting application.
But what would it take to actually do something with the results. You can't
really have this to their doctor, because you can't talk to their doctor. How
do you actually take the output from your predictions or your predictive model,
how would you actually be able to use it?
>> Munmun De Choudhury: Okay. So I'm going to talk about two different
directions that we are working on to make something like that happen. The
first one is personal health and personal interventions. So the idea is -actually, I didn't show it here, but we have a little tool that we have built
out. It's a very private tool people can install on their smartphones. That's
the idea. And it would connect to the way social feeds and show these barrel
measures so it's kind of a self-reflection mechanism. It's not very strongly
intervening, because our tool is not going to tell them, hey, our algorithm
24
thinks that you're depressed.
That would creep them out.
So it will be more -- we're thinking of it as a more subtle intervention so
people can reflect, they can see that, okay, I have been extremely depressed -I mean negative -- my negative affect has been really high over the last couple
months. It's not normal. I'm usually not like that. After that, we can
decide what they want to do. They can go to a doctor. They can talk to a
friend or family and they can take it forward.
The second direction that we are considering is we are talking to a healthcare
provider. They are extremely interested in these kind of tools because all the
detection, it can minimize healthcare costs and there can be other kinds of
benefits, and they're actually in a position where I cannot go in and help
people. Like you're saying that I cannot talk to doctors, but healthcare
providers have the resources to make something like that happen.
The third option which we are not yet doing, but that's certainly a
possibility, is to collaborate with a hospital or a set of physicians in kind
of convince them that can you share this tool with your patients and monitor it
in a completely and honest manner.
>>: [indiscernible] I thought this would be very interesting because
[indiscernible] because it targets certain like [indiscernible]. So they post
to my Facebook saying [indiscernible]. So it sounds very useful for Facebook
or something like [indiscernible] or something.
>> Munmun De Choudhury:
That's the direction we're trying to avoid.
>>: So my question is not what use is it, but actually how do you get around
the regulatory hurdles. Because it seems like basically having it in the hands
of either the end users or the people who already -- your health insurance
provider doesn't, like they already have your health insurance information.
>> Munmun De Choudhury:
Yes.
>>: [indiscernible] there's also forms of depression like manic depression
with very strong swings. It can actually be very helpful.
(Multiple people speaking.)
25
>>: Yeah, if you're aware you're suffering from a health question and you're
[indiscernible] you're willing to accept those predictions, it can alert you to
sort of the period right before a mood swing that could actually make a
difference.
>>:
[indiscernible] would not be just Twitter but email and Facebook.
>> Munmun De Choudhury: We are actually sort of thinking like a plug-in so we
are thinking about that option as well.
>>: Notice the emails from this person [indiscernible].
feedback.
Delayed response
>> Munmun De Choudhury: Other kind of potential uses, diary centric systems so
it could be like a log-in mechanism for people to kind of see weekly what has
been my pattern of emotion and having like a diary-centric things when they go
off to a doctor, they can show them what it has been. This is potentially a
little bit more fine-grained and the doctor asking you like unstructured
questions like how did you feel over the last one. So that is another
potential option to do.
>>: So most the results seems to be
understandable, like [indiscernible]
[indiscernible]. Was there anything
intuitive or different from what you
assess the results?
kind of intuitive. So things like
expressive or if they go to
that you found that was kind of counter
had expected or, in that case, how do you
>> Munmun De Choudhury: I think the assessment of results in this case is not
to be surprised but to be able to predict something, and although it's
intuitive in hindsight, but I think there is a lot of kind of thoughts and
research about what kind of factors are associated with depression and I think
this kind of research kind of tells us, in a social media context, what kind of
factors that one should be looking for if they're trying to predict depression.
In terms of what surprised us is kind of some of the predictive power that
result from the linguistic style features. So the pronoun use was pretty
interesting. So first person pronoun use was pretty high for these individuals
who were suffering from depression, which actually we go back, we have evidence
in psycho linguistic literature that these kind of things are associated with
depression. But not in a social media context. It's good to have a validation
26
of that in a social media context.
So that part was pretty interesting.
>>: Just another idea, I have a friend doing status books because people
[indiscernible] and those things kind of [indiscernible]. But it does not
matter to other people so they can ultimately predict what kind of
[indiscernible] will be depressed by diagnosing pills. So maybe combined with
your research, they can [indiscernible] because some people get depressed even
without that. So it's [indiscernible].
>> Munmun De Choudhury: Actually that's especially true for the study with new
mothers, because like you mentioned that it might be very overwhelming or there
could be other contextual factors, but because there is so much of things going
on, people might not realize that something might be off. And having a
mechanism to do these kind of predictions can especially useful to kind of
raise a red flag and make people aware when things go out of the ordinary.
Download