1 >>: Okay. So the talk will be given by Hannah Hajishirzi. She joined UW last October, and she received her Ph.D. from UIUC, and after that, she worked at CMU and Disney Research as a post-doc. Her research interests are in AI and machine learning, and her current research is mainly focused on semantic analysis of natural language text and designing automatic language-based interactive systems. And the title of her talk is learning with weak supervision in grounded language acquisition. >> Hanna Hajishirzi: Thank you. Hello, everyone. Thank you for the introduction. The talk that I will present today is about a domain adaptive technique that we made for grounded language acquisition. And I show some experimental results in understanding professional [indiscernible]. Semantic parsing, in general, is the problem of learning to map a sentence to its meaning representation or to some sort of logical form that is understandable by a machine. For instance, the sentence Essien is penalised for a challenge in Fabregas in midfield can be mapped to something like this. There is a foul by Essien on Fabregas and the location is in the middle of the field. And Essien is penalised by referee, and referee gives a free kick to Fabregas, and the location is again on the midfield. Semantic parsing in general is a very hard problem, but sometimes it is easier to understand the meaning of a sentence if you have access to the state of the world. So basically, understanding this sentence would be easier if you know that this is about a soccer game, and you have the events that are carrying the soccer. So basically, given this information, you are able to correspond this sentence to the events that are carrying the soccer game. So the problem that I am interested in solving here is basically learning to correspond every sentence to the corresponding part in the event representation. There have been many great work in the literature on semantic parsing and grounded language acquisition recently, and people recently are using machine learning techniques and they all use different sources of supervision, some use fully supervised setting, some decrease the amounts of supervision and work in weakly supervised setting. Some do clustering based techniques or use 2 reinforcement learning. And all these people have achieved great performance in many domains, in particular, for grounded language acquisitions, the domains of interest are robocup soccer commentaries, the instructions, including all sorts of instructions like Windows help instructions and navigational instructions and so on, and also understanding weather reports. And some of these approaches are fine tuned towards a particular domain, but some are domain adaptive and can work well on all of these domains. Now, the question is what if we moved to a more challenging dataset, like when we have more complex sentences and the world state looks more complicated. But dataset that we are using for this project was basically professional soccer dataset, where we have English sentences that are commentaries generated by professional commentators and the text is transcribed, and the events are the sequence of events that are carrying the soccer game. So every event has a type many sorts of events, and the player, the team, and location of the player or so on. associated with it, like pass, tackle, dispossess and every event has some arguments, like the qualifier, also many more that I haven't listed here, like the the time -- the location of the zone of the field and As I said, the goal is to map the sentence to the corresponding event. So let me show you the challenges that we encounter in this domain. So there are -there are some challenges that still exist in the previous domains, but they are more difficult to handle than in this setting. First of all, building a one-to-one supervision or one-to-one alignment for training data will be very expensive. And the reason is finding the segments of the text and finding the corresponding events in the game and map them together will be very difficult and also very subjective task. So following previous work, we decided to use weak supervision in terms of for every sentence, we use -- we build an ambiguous training data by mapping that sentence to the events that are carrying the temporal vicinity of that sentence. And so as you can see here, we have an ambiguous training data. So this problem existed in the previous domains as well. However, here because we have much more events in the world state, the ambiguity will be harder. 3 The next challenge is that the sentences are much longer, and also they have complex structure and a lot of paraphrases. Because these are professional commentators and they wanted the text to be more exciting, for instance, for just representing this pass event, they have used many different phrases. Like the song began the move. That means there is a pass by Song. Exchange with Wilshere on the edge of the box, which is talking about these two passes, two consecutive passes between Song and Wilshere and also before stealing the ball off Fabregas foot, which is talking about a very short pass from Fabregas to Song. Or just to represent the goal event, they call it with this sentence, sliding it home with his left foot from ten yards out. As you can see, there is no mention of goal or something like that when they describe this event. Another problem is there are many events in the world state that the commentator won't talk about. And in particular, in the data, this is very severe, because we have an ambiguous training data, and the commentator only talks about some of the most important events in the domain. So it's much harder to come up with which events to select. Or there are some sentences that do not map to any part of the world state, like the commentators talk about some statistics or game analysis. For example, they say when a fantastic goal, and it's not aligning with any events in the game. And another challenge is that the text a part of the text usually refers to a combination of events that are carrying the world state. For example, the commentators define these two past events together and call it exchange with Wilshere at the edge of the box. So this is a simpler kind of macro event. And in a more complicated scenario, the commentator says something like arsenal is coming forward, meaning that they're talking about a sequence of past events that were from the back of the field to the front, and they just call it with one sentence, arsenal is calling forward. So now our goal is to solve all these sentence to the corresponding meaning goal is for every sentence SI and the find the best macro event inside that problems and learn to correspond every representation. Or more formally, our bucket of events, V of SI, our goal is to bucket that has the best correspondence. 4 To be able to ask these questions, we need to consider two things. First of all, how do we measure the correspondence between a sentence and event, and also how do we rank them or basically how do we try to combine some of these events together to find out what is the best macro event. So let me walk you through this and see how do we answer these questions. First let's go with the easier question. How do we measure the correspondence between one sentence and one event. Forget about the macro event scenario. Really, how do we distinguish between the cases like this, Chelsea looking for penalty as Malouda's header hits Koscielny. This is the same sentence, and it is paired with two events. One is the pass as the event type and the arguments are head, Chelsea and Malouda, and the second one, the event is foul, but the arguments are head, arsenal and Koscielny. How can we say the pattern of correspondence in one of them is different from the other one? And how can we say that one pattern of correspondence is better than the other? So, for instance, here no pass events [indiscernible] so this pattern of correspondence should not be a popular pattern. So we have this intuition that a pattern of correspondence would be good if it occurs frequently among the data. So, for instance if I see some sentences like this with a foul event, I would say this is a popular pattern. Therefore, our goal is to look at the pattern of correspondences, all the pairs basically, pair every sentence with all the events in the bucket, and look at this global information. Find which of these patterns are more popular. But to be able to find out the popular patterns, we notice that we really can't say these two patterns are identical to each other. Therefore, we need somehow, some way to compute the similarity between pairs. And notice that the pattern of correspondence is really different than similarity. When I'm saying similarity, I'm interested in similarity between two pairs. And pattern of correspondence is actually inside one pair. So our solution is to rank the pairs according to the popularity. But we need a similarity measure to be able to find the similarity between pairs. So let's see how can we say these two pairs are similar to each other or not. One idea is look at this sentence in terms of the bag of words and then just 5 compute their cosign similarity or some -- the known notions of similarity. But as you can see, they might be high in similarity for these two pairs, because they share a lot of things among themselves. But what we really care about is the similarity between these patterns of correspondences. For instance, here the co-occurrence of words really matter. For instance, you see the word header probably is a good match for head, or the team name, the player name. But here, you don't see anything about the team name. The player name is aligned, but you see some words like header, hits, or penalty which probably a good match with respect to the foul event. So therefore, we want to say that this pattern is really different than that one. But the known notions of similarity won't be able to capture this. To be able to model this, we build a new notion of similarity or a discriminative notion of similarity in this way. We want to discriminate this pair from all the rest. Basically, we want to say that this -- we want to learn the particular pattern of correspondence of this pair from what it is not, rather than they're similar ones. Because generating negative examples is much easier for us. And to that purpose, we used a technique that is recently very popular in computer vision, which is the idea of exemplar SVM, which tries to learn one instance based on what it is not, rather than what it is. And the idea is we put one pair as that positive example. We basically learn SVM is pair, pair. And this example, this SVM gets one positive example and a lot of negative examples. And we artificially build negative examples by the [indiscernible] that we are sure they will be different than the original pair. So we say we select the sentences that are similar to that one, but the events are different and vice versa. The sentences that are different from the initial sentence and the events are similar. Then for every pair, we build a classifier which is basically assigning a weight vector to all the features. And for features, it's simply use bag of words, just represent a sentence with simple bag of words. And the events are the event [indiscernible] all the arguments. Now, we can easily define how two pairs are similar to each other or not. For that purpose, we say two pairs are similar if, when we apply the model of one 6 pair on top of the features of the other one, it should be in the positive set. And vice versa. This way these two pairs are similar. And we defined our similarity according to some function applying the classifier of one on top of the other feature and vice versa. So far, so good. We are able to define some notion of similarity between pairs. And just to show you some qualitative results, look at this sentence that we were talking. The highest weight elements that our approach returned for this pair was actually something that we expected. Penalty, hits, foul, header. And we also had some relatively high score to these words as well. And the pair that was the most similar to the original pair was this one. Alex Song is too strong for Essien, who goes down looking for a foul. Versus the event foul, the arguments are long ball, Chelsea and Essien. So as you can see, the sentence is very different, the event is different, although they are both talking about the foul event. So as I said, our goal is to rank a pair inside the global domain, inside all the pairs that we have in the domain. To that purpose, we build the underlying structure of the pairs by using the similarity metric that we define. So basically, we build a graph. Every node in the graph is one pair, and the nodes are connected to each other if they are similar to each other. And now, our goal is to rank the pairs according to the popularity in this graph. So what is a popular pair? We say a pair is popular if there are a lot of other pairs that are popular or similar to that pair. And this recursive definition is very similar to the page rank idea, that the page is important if there are many other pages that are important has a link to that page. So we basically use a random-walk algorithm which is very similar to the page rank algorithm, and we wanted to compute the importance or the popularity of each pair inside this graph. So what we do is basically use this formula, and we look for every node or every pair, we look at the neighbors and update the score of this pair according to the importance or according to the popularity that it gets from its neighbors. And we continue on until the scores converge. Okay. So just to summarize what we have done so far is that we build all the pairs that were possible in our domain we modelled our correspondence and used discriminative similarity and page rank algorithm and ranked the pairs according to their popularity. Then we project these ranking inside each 7 bucket so now, if we knew there is only one event for every sentence, we would be fine, because we said okay, the best, the highest ranked pair is the [indiscernible], it's going to contain the event that we are interested in. But you remember, there are multiple events corresponding to every sentence. So we are looking for macro events. How do we do that? Can we go and search all the possible combinations and just say this macro event has the highest score? It is possible, but it is not efficient. It's not feasible, really. Because as I said, we have highly ambiguous training data, and the size of these buckets are pretty large. But we still have one other problem. Initially, we would like to compute the correspondence between one sentence and a group of events. But what I've already described was only measuring the correspondence between one sentence and one event. So for that purpose, we really need to bring macro events into our system. But the idea is very simple. We want to build a new model, a new pair model for basically macro pairs, the pairs that include one sentence and macro event. And also, we want to be able to rank macro pairs according to this new model. So how do you build a macro pair model? Again, we use the exemplar SVM idea, and we want to say what is discriminative about this particle or macro pair? And we put that macro pair in terms of the positive examples and then build the negative examples as I described before. And now, we have a model for this macro pair. Then we can easily define the similarity between different macro pairs. But when we want to apply the pair model on to another macro pair, how do we define -- how do we compute the similarities scores as follows? We apply the model on top of each of the features of the pairs inside the macro pair and just output the highest score. So now we are able to measure the correspondence between one sentence and a macro pair -- and a macro event. So I can easily go and rank them accordingly. But again, this is a search in a combinatorial space. How do we solve this problem which looked at the measure, actually the ranking function that we were describing? So this ranking function has a nice property that it is a sub-modular function. 8 Therefore, one greedy approach would help and would give us a good approximation of returning the best macro event. And our greedy approach is as follows. So we first select the highest ranking event according to our original pair rank algorithm, and then at each iteration, we add the event to the -- we always update this macro event. And at each iteration, we add one more event according to the gain that we get. So we always like to maximize this gain that we are receiving. And we Pete it until K iteration and until we don't get more gain. So basically, this is the essence of the work that we did, and we experimented the algorithm on the professional soccer commentary dataset that I was talking about. So for evaluation purposes, we build gold standard by learning every sentence to the corresponding events, and we did that for about eight games. We have around 900 sentences and about 15,000 events. In total, we have about 40,000 pairs, and on average, we have 40 events per bucket. In gold labels, we have almost about a 1,400 pairs that are right. And this number is very small versus the original number of pairs, and the reason is there are a lot of sentences that there were no corresponding event in the sentence, because the commentators were talking a lot about the game statistics and those things. So for comparisons, we compare our approach with some base lines. In particular, we want to look at each of the components of our system. What if we replace the similarity metric with a cosign similarity or if we remove the ranking algorithm with simple counting algorithm. And also, we compare it with previous work, basically the first work that compare with was the work of Liang, that he has a generative approach that is also domain adaptive and works well in learning to align sentence to events. And also, with state of the art instance level, multiple And the idea is of multiple instance learning is because similar to the multiple instance learning setting, where can resemble the bags in multiple instance learning, and bag at each time. instance learning. our setting is very each of these buckets you have one positive So here are the results that we get for all the games. So these are all the game halves and this is our method, and this vertical line shows the F1. As you can see, we were consistently higher versus all the previous methods and 9 base lines, although still the performance is not high. Around 48, 50. We also made some comparisons, the average comparison in terms of F1, AUC and precision and recall. With the previous approaches, again, the red one is us. And also, with the baselines. We also studied our macro event finding scenario. As you can see, when we add macro events to the system, our performance increases. So this shows the cardinality of macro events, and this is the F1. And as you can see, after about four or five iterations, we really didn't make more improvements. And here are some qualitative results. This sentence is Cole is sent off for a lunge on Koscielny. It was poor. It was late. But I'm not entirely sure that should have been red. So this is talking about a foul event from Koscielny on -- no, on Cole, yeah. And it is talking about a red card but never mentioned -- hasn't any mention of card. But our method could capture that there is a foul event and this is a card event. Another thing is first attack for Drogba, out-muscling Sagna and sending an effort in from the edge of the box which is blocked, and Song then brings down Drogba for a free kick. So it says Drogba is going to attack so takes on, and then out-muscling Sagna. So Sagna is challenging, and sending an effort in from the box, which is blocked. So there is a save event here. And Song then brings down Drogba, so there is a foul by Song. So this is the last event. And this one, two poor efforts for the price of one from the free kick as Nasri's shot hits the wall before Vermaelen fires wide. So it is two attempts. So the first one, it is just the attempt was saved because it hits the ball. And for the second one, the Vermaelen misses it. So basically, there is a bad shot. We also tested our method on one of the benchmarks in grounded language acquisition, and this is our performance after compared to the previous ones. So to summarize, I introduced you a method which was domain adaptive for learning, to learn sentences, to corresponding event representations, and the main idea was to build the underlying structures of the pairs and then rank the pairs and use this global information and find out what is the best one. And we improved the state of the art for these tasks. 10 So to continue, what we are currently doing is basically applying this same techniques in multiple instance learning scenario. As I said, the setting of the problem is very similar to ours so we have already achieved very good results when we apply same techniques in multiple instance learning. And I also like to continue this on automatically generating stories, maybe one example would be automatically generating commentaries, given the game events. As I said, we can handle macro events, but there's still something like Arsenal is coming forward, our approach cannot capture that, because they are really hard, they require a lot of reasoning, inference, planning, a lot of more things. And finally, we like to do event recognition and go beyond the grounded language acquisition, do event recognition without requiring to have access to the event log of the game. So thank you. I would be glad to take questions, sure. >>: This is a wonderful talk and very clearly presented. So thank you. And the phrase grounded language acquisition suggests that you learn something about language that could then be applied in some way. The future work ideas there are similar. I'm wondering if you could say a few words about how you might take what the system has learned and use it to do something like generate commentary or maybe even, more simply than that, given commentary and no event logs, find the sentences that do correspond. >> Hanna Hajishirzi: That's a very great question. So I have some ideas how to do commentary generation. So the ideas that I have, so first of all, notice I have this long sentence and map it to a group of events. But still, I haven't done any segmentation, saying that which part of the sentence goes to which words, right. So this is something that I am thinking about it to solve this. Then I will have a bigger dataset, right. Shorter phrases and kind of corresponding events. Then my main goal is to be able to look at these events, kind of and doing it vice versa, right. So I have generated this one-to-one correspondences, and then from every event, I would like to generate a sentence -- maybe generate some templates and then do commentary generation. Or maybe apply some techniques from reinforcement learning to be able to decide if you want to report that sentence right now, or you would like to combine it with previous sentences, or maybe it's time to wait for more sentences and then say 11 something. So these are some things that I'm thinking about. But the question about how can you understand this without having access to the event log, that is a very hard question. So some ideas that I had was what if I had some information about the domain knowledge, or I know this is a soccer game, and I have some information about the types of events, the meanings of events and so on, so I might be able to take that into account. Or another possibility was kind of use the results of this approach, learn these patterns of -- use these patterns of correspondences and later apply it on the new sentences that I am seeing. So this is a very hard thing to do, but I have some very high level ideas. >>: So it sounds like one of the main advantages to your approach, instead of handling multiple events [indiscernible] so does that show up at all in the robocup data, or do they sort of simplify it? >> Hanna Hajishirzi: So there are two then, in robocup data. For every sentence, you really don't have multiple events corresponding to that. So this is the assumption that there is always at most one event. It might be zero, but most -- but it couldn't be more than one. >>: I see, okay. >> Hanna Hajishirzi: But in the better report, that's true, you might be able to handle the group of events. But what do they do is basically, again, so there was this work of Percy Liang that he did segmentation, but for every segment, he aligned it with exactly one event. But still, he couldn't handle this case that for every part of the sentence, you can map it with multiple events. Or there are some other techniques that they might just use a threshold and then say, okay, if it is great, the similarity or this correspondence higher than some amount, you just say this is the events that we can report. >>: So you also seem to be learning some really cool paraphrasing information there. >> Hanna Hajishirzi: Yeah, actually that's something that I talked about. 12 That's true. The reason is because we are doing, kind of looking at this kind of parallel [indiscernible] thing so we might be able to capture paraphrases, except we don't have enough data. I mean, we have enough data, but we don't have enough kind of annotated data to see if we got it right or not. So we might be able to annotate more things and see if it's working or not. >>: I was wondering if those errors were coming from just not having enough paraphrase information or if there was just a lot of indirect missing language that it's harder to [inaudible]. >> Hanna Hajishirzi: I would say because I really didn't have enough data, I really can't answer your question. But I think there are -- most of -- not some of it. At least half of it is because of the paraphrasing. So you see it's talking about a [indiscernible] ten yards out. So this is not at all similar to a goal event. So it is very hard. And one other thing is I am using very rich features in terms of the language. So this is only bag of words. I even don't consider bigrams so you might be able to improve at least in that part. Yes? >>: I wonder so for machine translation you have [indiscernible] two-word translation [indiscernible]. >> Hanna Hajishirzi: Exactly. It is very similar. I didn't do any comparisons myself, but Percy Liang in his work actually compared with the IBM models for machine translation, and his approach was much, much higher than the machine translation techniques. I didn't add comparisons here, actually. >>: But do you feel -- >> Hanna Hajishirzi: Thank you. There are very certainly similarities, that's true, yeah. >> Michael Gamon: Okay. We're going to the second talk now. It's a pleasure to introduce Munmun De Choudhury. She's a post-doc, and she got her Ph.D. in sunny Arizona from Arizona State and she's doing a lot of work in computational behavioral science, especially with respect to lately with respect to health related issues and looking, also looking a lot at social media. And I think that's a really interesting combination and we'll hear some of the latest work. 13 >> Munmun De Choudhury: Thank you, Michael. Hi, everyone. Good afternoon. I'm very excited to be here today and I'm going to talk to you about some of my recent work on how we can mine online behavior from different social platforms and how it can be leveraged to improve health and wellness. This is a joint work with Michael, who is sitting over there, and other colleagues at Microsoft Research, Scott Counts and Eric Horowitz. So we are truly living in an information era today, whether it is we want to seek information about our favorite celebrity or our topic of interest, or it is sharing information with our friends and family and audiences about big and small happenings around us. In fact, one in six people in the As we constantly usual enter into sense it is providing researchers an unprecedented scale, which was world today is an active user of Facebook. these information-rich environments, in some a whole new tool to analyze human behavior at not possible before. A key aspect of many of these online platforms, such as Twitter or Facebook, is that people use these platforms to share reports about really big, global happenings. For example, the riots in the Arab springs, the earth quakes in Haiti or, more recently, the presidential elections. A not so talked about aspect of these platforms is also that people use these tools to share their opinions, feelings, and their thoughts around a wide variety of really personal happenings. For example, the birth of a child in a family, the death of a loved one meeting with a traumatic accident, moving to a new place and so on. In some sense, what we are seeing is that social media tools and social networking sites such as Twitter or Facebook in some sense act as a whole new lens or a window into understanding people's behavior around these big life events. And the reason that this kind of research is interesting is beyond elucidating our core aspects of human behavior at really large scales, it can enable us to identify concerns or issues that may arise in our behavior in a longitudinal manner over a course of time. For example, mental illness. If we talk about mental illness, it is a very serious challenge in personal and public health today. More than 300 million people in the world are affected by depression and it is also responsible for the more than 30 thousand death by suicides that happen in the United States 14 every year. What makes things even worse is that a lot of these statistics are actually under reported. CDC and the National Institutes of mental health, they often conduct surveys, usually once a year or sometimes once in several years in order to estimate levels of depression in populations. However, these kind of approaches lack temporal granularity. And what we are going to talk about today is how we can leverage online activities and specifically what people are saying and doing on social media as a potential tool to address these situations. More specifically, as a tool in behavioral health. And we are going to do that through two different problems. In the first problem, we are going to examine how social media can be used as a mechanism to understand the behavioral changes in new mothers in the postpartum period compared to prenatal period. And why is that research interesting? If we talk of applications, this kind of research can lead to the design and deployment of low cost privacy preserving intervention programs or early warning systems, probably, that new mothers can use in order to receive timely and appropriate assistance when things may go out of the ordinary, maybe even to seek social support from their friends and family and, in general, to improve their health and wellness in postpartum. A key challenge of this research was that how do we identify these new mothers in a specific social platform? So we chose Twitter, and in order to tackle that issue, we first start with identifying different birth events, based on the postings in Twitter. We moved to online newspapers and looked at the kinds of announcements that people made in order to post on the birth of their child. We [indiscernible] several key phrases like you see on the slide, and what we did was after that, we went on to Twitter firehose, which we have access through, through a contract with Twitter, and looked at the postings on the firehose in order to find which postings could be indicative of a birth event. The authors of those postings helped us construct a sort of a candidate set of people who could be new mothers. But, of course, that set had noise. Plus we eliminated some of the individuals who were not female users based on a gender classifier. And after that, we used a crowd sourcing strategy to come up with more precise set of people who are likely to be new mothers reporting on the birth of their child. After that, we again went back to Twitter and collected several thousand postings from each of these mothers over a five-month period 15 in the prenatal period and a five-month period in the postpartum. So how do we measure the behavior of these new mothers or maybe even how the change in postpartum compared to prenatal. We defined four different categories of measures. The first one being activity related. For example, what is the volume of the postings over the course of a day, what is the degree of social interactions that they're engaging in, which could be estimated through the actual replies that people post on Twitter. What kind of questions they asked and the numbers of links they shared. We defined two different measures of ego network, inlinks and outlinks corresponding to the number of followers and followees on Twitter. There were four different measures of emotion that we used. The first two were positive affect and negative affect that we measured using the cycle linguistic resource LIWC, and activation and dominance, which measures the intensity and controlling power, respectively, of emotion and we used another lexicon called ANEW in order to estimate that. Our final measure of behavior was linguistic styles. And styles are a mechanism, they act as a proxy to understanding people's behavior and social and psychological environments. And we again used LIWC's 22 different linguistic styles in order to estimate the behavior of these new mothers. Let us talk about an empirical study of how these mothers change in their behavior in an aggregated manner in the postpartum period compared to prenatal. We also compared the behavior of these mothers with a background cohort which is a randomly sampled set of 50,000 Twitter users posting in the same time period as these mothers, and who had no indication of having given birth to a child in that period. We notice from these various charts, each of which correspond to a behavioral measure, that the mothers who are shown in the red line, they actually undergo quite a bit of change in their behavior in postpartum period, which is the right side of the blue line. For example, the volume of activity seems to go down. Positivity also goes down, whereas negative affect goes up. Activation and dominance, which measure the intensity and controlling power of emotion, both go down together. And the use of first person pronoun seems to go up. 16 However, presumably, there are some mothers who changed more than the others, right? And being able to identify and track who those mothers are and in what way they're changing can have implications and identifying concerns in their behavioral health. We moved over to individual-level comparison for the purpose. What we see in the slide are heat maps corresponding to two different measure, positive affect and activation. And we -- this is an RGB heat map. So blue corresponds to lower values. Red corresponds to higher values, and green and yellow, everything in between. So we notice that our conjecture is actually true. So for several mothers, we noticed that positive affect and activation seems to go down in postpartum, which is the right side of the white line, compared to the prenatal period. We wanted to statistically quantify the degree of change of these new mothers across all different behavioral measures. We computed effect size measurements, small, medium and large effect size using coherence T. And this is the result that we have here. We notice that for quite a few mothers, about a quarter in our sample, we noticed that they actually showed lower activity in an extreme manner. This is kind of intuitive, because probably a lot of these mothers are so overwhelmed with the new responsibility that they don't have enough time to turn to social media and do posts. However, if you look at the number of mothers who show large effects as change for emotion, we notice that they're smaller in mother. On tracking the behavior of these mothers, across all different categories of the behavior measures that we are considering here, we noticed that these mothers show consistent change, extreme change in their behavior across all the different measures we are considering here. If you want to look at the behavior of the kinds of postings that these extreme changing mothers are making in a qualitative manner, we'll notice that these mothers with large effects are posting about feeling lost, which basically indicates they're feeling helpless. They also talk about loneliness and insomnia, sleeplessness, and even use self-derogatory phrases such as worst mother or horrible monster mother and so forth. That doesn't seem to be the case with the mothers with small effects as 17 changes. That is, the mothers only change a little. In fact, they seem to be pretty excited and seem to use these social media platforms in order to derive informational benefits on different questions they may be having around bringing in a newborn from their community on Twitter. We wanted to quantify these differences in the use of different unigrams across the cohorts of mothers. We notice that for the mothers who are showing extreme change, more than a quarter of the unigrams in their postings are showing statistically significant change in postpartum period, compared to prenatal, based on a T test. However, those numbers are relatively smaller for the mothers with small effect size change and actually much smaller for the background cohort. So what are these top changing Unigrams like? If you look at the table below and the slide, we'll notice that a lot of these unigrams for the mothers with large effect size changes are emotional in nature. And, in fact, the positive affect unigrams seem to go down in usage after a child birth, whereas the negative affect one seems to go up. That is not an attribute we notice in the other two cohorts. On those lines, we wanted to identify what is the span of language that makes these mothers truly different from the rest. For that purpose, we devised a strategy which we called the greedy difference analysis. What this does is it's an iterative strategy in which we start from the top-changing unigram for the mothers with extreme changes and keep eliminating one unigram at a time. At every iteration, we compute the distance of the unigrams, broadly the language that they're using with the other two cohorts. We find two really interesting findings there. We notice that with the elimination of a little over one percent of the top changing unigrams, these mothers with large effect size change become similar in their language use to the mothers with small effect size changes. And for the elimination of a little under 11 percent, they actually become similar to the background cohort. What this tells us is that there is actually a really narrow span of language that is making these mothers different from the rest. And this gives us hope that this kind of discriminatory attribute can possibly be leveraged in a prediction framework where we utilize data of people's -- of mother's activity over the prenatal period in order to predict ahead of time who is -- who are going to show extreme changes in their behavior. 18 For that purpose, we came up with a prediction framework. First of all, we expanded our data collection set to close to 400 mothers. We used a binary classification framework in which we intended to predict the labels of two classes of mothers. The first one corresponding to extreme changing mothers, which is when the behavior of a mother along a certain measure surpasses a certain optimally chosen threshold and standard changing mothers. And we used support vector machine classifier with a radial basis kernel for the purpose. We first trained a model simply based on prenatal data spanning over three months before the birth of a child of a mother and tried to predict their directionality of change in their behavior three months after child birth. We summarized the performance of the model in this slide. We noticed that we're doing pretty well. We have mean accuracy of a little over 71 percent, with pretty good precision and recall. And if you look at the ways behavioral measures, linguistic styles seem to perform pretty well, especially pronoun use by the mothers and if you talk about emotion features, we notice that negative affect and activation are good predictors. We summarize the performance of that model based on the [indiscernible] characteristic curve you see on the right of the screen. As you can figure out, there is room for improvement. And we wanted to explore that are there behavioral cues in the initial one to two weeks after the birth of the child that could be leveraged to figure out how these mothers are going to change later on. So we trained a second model which used the three months of prenatal data along with an optimal training data which we derived using expectation maximization spanning one to two weeks after the birth of the child. And we attempted to predict the directionality of change of mothers three months after that. You'll see from the [indiscernible] curve on the right side that we actually improve in our performance, and our accuracy goes up to 80 percent. So what are the implications of this research? We have been able to identify a set of 14 to 15 percent mothers for whom we see really extreme behavioral changes in postnatal period. For example, their level of activity and social media goes down. Their positive affect seems to go down as well, whereas negative affect goes up. They show increased levels of use of first person 19 pronoun, which is associated with high self-attentional focus. And a lot of these changes are actually pretty consistent over the entire postpartum period, and with the help of a prediction framework, we have been able to use prenatal data and initial data over a period after -- immediately after child birth in order to predict some of these changes. If you look at psycho linguistic literature, we'll notice that a lot of these behavioral markers are actually associated with depression. In fact, the 14 to 15 percent of mothers that we identified to be suffering -- exhibiting extreme behavior, it's interestingly, it aligns with the reported statistics of postpartum depression in the United States, which is a mental concern that arises in some mothers right after child birth and it's typically pretty underreported. This research gives us hope that utilizing people's activity, their language and emotions on social media, we may be able to devise unobtrusive behavioral markers of mental illness such as postpartum depression in this case. And actually, the possibilities are not just limited to what we can do with social media data in terms of behavioral health. In a second problem, we are going to talk about how we can go beyond postpartum depression predictors and look at other mental illness, a common form which is called major depressive disorder. A challenge in a lot of this type of research is that how do we go about collecting ground truth data? That is, people who actually were diagnosed with clinical major depression or show signs of that or vulnerability of that. We adopted a crowd sourcing strategy for that purpose. What we did was on mechanical -- on Amazon's Mechanical Turk, we showed crowd workers a standardized depression survey, which is the CES-D and we also asked them a few questions about depression history, if they were having one. If a crowd workers liked to, they could even opt in and share their Twitter user ID with us. In this manner, we obtained a little under 500 individuals through the crowd sourcing study, and we went on to Twitter and collected their postings spanning a year period, which is several thousand postings before the reported onset of depression that they told us. Let us take a look at the behavioral differences of the depressed class and the 20 non-depressed class. On the left, we have -- we show the patterns of the postings on Twitter. And what we notice is that for the individuals who were reported to be suffering from major depression, they show raised activity after hours in the night and actually much lower activity during the day. In fact, if you look at clinical literature, we'll see that eight out of ten people suffering from major depression actually have symptoms of insomnia as well. We see differences across a number of different behavioral measures, like we observe for the new mothers. For example, for number of social affect is higher depression terms the depression class, volume seems to be lower and so is the interactions as measured through replies on Twitter. Negative in the one-year period that we looked at and the use of is also higher. Let us look at some of the depressive language that these individuals are using. We'll notice that in their postings, a lot of these users are talking about their symptoms, such as fatigue, headache, nervousness, that they're suffering from. They often use social media as a discourse mechanism, because they want to seek some sort of social support or connect with individuals with similar psychological experiences. They actually have pretty detailed information about their treatment. For example, they talk about antidepressants and even levels of -- and even dosages of different drugs, as you can see with, like, 150 milligrams, 40 milligrams and so forth. And broadly, they actually also talk about relationships and life in general with, actually, pretty strong emphasis on religious thoughts and religious involvement. A lot of these markers, if you look at clinical and psychiatric literature, will see that they are correlated with precipitators of depression and symptoms of depression. Next we trained a support vector machine classifier like we did before in order to predict which individuals are vulnerable to depression based on the year-long period of their social media postings before the reported onset. We used several different measures corresponding to several different features corresponding to the different behavioral measures we are considering, such as positive affect, activation, volume of posting and so forth as measured through frequency, variance, their momentum and entropy over the year-long trend that 21 we are having. The performance of that classifier is shown in the ROC curve on this slide. We get pretty good traction of accuracy up to 70 percent using this model, which gives us hope that it's probably possible to predict before the reported onset whether or not someone could be depression vulnerable. Next we wanted to use these insights and come up with some sort of an index, which could give -- lend us insights into population scale levels and trends of depression. At some point earlier in the talk, I mentioned about the limitations of surveys by CDC and NIMH, because they don't have temporal granularity, which makes it really difficult for governmental agencies or other people to implement effective mental health programs. So a mechanism or an index that relies on social media and could estimate levels of depression in a fine grain manner over the population could potentially be really useful. In the slide that you see here, on the left we show a heat map showing the actual levels of depression in the 50 states of the U.S. As reported by CDC, and NIMH. On the right side, we show the same thing, however the levels of depression here on the right are measured using our social media depression index. And if you'll see, interestingly, we have pretty good correlation of up to 0.5 for our prediction, predictive measures with the actual measures. We also investigated a lot of other behavioral trends associated with depression. So, for example, we found that women tend to have greater incidence of depression compared to men, a fact which is also supported by CDC data. We notice that across different very high incidence of depression cities, we are doing a pretty good job in terms of correlation. We notice that if you look at the diagonal trend of depression for both men and women, it seems to be depression is higher at night than during the day. And there also seems to be some sort of a seasonal component to depression, in that cities in the U.S. with more extreme climatic conditions seem to have greater depression during the winter than during the summer. So weaving together the observations from the two studies that I just 22 discussed, I'm going to highlight the possibility of using social media as a tool in behavioral health and being able to do some kind of predictive health using such tools. And thereby makes a transformative impact on a range of health services in terms of making healthcare available to individuals anytime, anywhere and at a greater large scale. And [indiscernible] actually go beyond behavioral health conditions. We can probably begin to reason and track and even predict about other kinds of health conditions such as autism, diabetes, or the obesity epidemic using social media behavior. I believe that social media, with the rich activity that we can observe what people are saying in their linguistic expression or their natural structure or the connectivity with their friends and family has a lot to offer in being able to make a difference to this domain. With that, I come to an end of the talk. any questions. Thank you, and I'll be glad to take >>: [indiscernible] some people are less likely to report depression maybe men versus women or something like that. Maybe just different groups of people are not on social media. [indiscernible]. >> Munmun De Choudhury: That's a very good point. For men and women, we did control for that in terms of women tend to be more expressive emotionally than men. But some of the other things which are really valid but which are actually quite hard to do. For example, you can think of SES, socioeconomic status to have some kind of effect but which is pretty hard to judge on Twitter. I think that these are very valid points. But because we're looking at fairly high scale data, large scale data, so hopefully we are looking at the head of distribution and most normal effect and the reporting [indiscernible] are hopefully the tale of the distribution. That's a really valid point and it kind of runs through any social media research that it tends to look at behavior. >>: I guess I'm wondering, because it seems like there are, especially in the case of the new mothers you're looking at, two different reasons that you might see people's emotion and behavior change. One being something like postpartum depression, where we're just not dealing with normal problems well. And the other where they're dealing with some sort of brief [indiscernible] stress. Still working, or baby in the NICU or something where emotional response or drop off in activity would be normal to some extent. [indiscernible]. 23 So when you're looking at all about how you can distinguish between people's whose behavior has changed because of some [indiscernible] that would justify that as a short-term problem in their lives versus people who were just not coping well. >> Munmun De Choudhury: Yeah, yeah. Actually, that's a very good point, because I think a very strong predictor of postpartum depression is any kind of mishap that may happen in people's lives just after -- during the pregnancy period or when the child is born. Usually is a very good predictor of postpartum depression. But I think it seems that some of the indications that we notice across the range of behavioral features, they are kind of more correlated with kind of some kind of a mental, psychological problem that they may be experiencing than kind of an artifact of their circumstances. And I think a good way to validate it is actually reach out and do a kind of more ground rich study on people who are diagnosed with postpartum depression and trying to see the differences. And currently, we are actually working on that. We are getting some ground truth data and trying to see if these would correlate with postpartum depression or could be another reason that people are behaving in that manner. >>: I just have a general question. Basically, how would you actually go about doing something with the results of this, right? So identifying symptoms of users of Twitter feeds is actually, really, really interesting application. But what would it take to actually do something with the results. You can't really have this to their doctor, because you can't talk to their doctor. How do you actually take the output from your predictions or your predictive model, how would you actually be able to use it? >> Munmun De Choudhury: Okay. So I'm going to talk about two different directions that we are working on to make something like that happen. The first one is personal health and personal interventions. So the idea is -actually, I didn't show it here, but we have a little tool that we have built out. It's a very private tool people can install on their smartphones. That's the idea. And it would connect to the way social feeds and show these barrel measures so it's kind of a self-reflection mechanism. It's not very strongly intervening, because our tool is not going to tell them, hey, our algorithm 24 thinks that you're depressed. That would creep them out. So it will be more -- we're thinking of it as a more subtle intervention so people can reflect, they can see that, okay, I have been extremely depressed -I mean negative -- my negative affect has been really high over the last couple months. It's not normal. I'm usually not like that. After that, we can decide what they want to do. They can go to a doctor. They can talk to a friend or family and they can take it forward. The second direction that we are considering is we are talking to a healthcare provider. They are extremely interested in these kind of tools because all the detection, it can minimize healthcare costs and there can be other kinds of benefits, and they're actually in a position where I cannot go in and help people. Like you're saying that I cannot talk to doctors, but healthcare providers have the resources to make something like that happen. The third option which we are not yet doing, but that's certainly a possibility, is to collaborate with a hospital or a set of physicians in kind of convince them that can you share this tool with your patients and monitor it in a completely and honest manner. >>: [indiscernible] I thought this would be very interesting because [indiscernible] because it targets certain like [indiscernible]. So they post to my Facebook saying [indiscernible]. So it sounds very useful for Facebook or something like [indiscernible] or something. >> Munmun De Choudhury: That's the direction we're trying to avoid. >>: So my question is not what use is it, but actually how do you get around the regulatory hurdles. Because it seems like basically having it in the hands of either the end users or the people who already -- your health insurance provider doesn't, like they already have your health insurance information. >> Munmun De Choudhury: Yes. >>: [indiscernible] there's also forms of depression like manic depression with very strong swings. It can actually be very helpful. (Multiple people speaking.) 25 >>: Yeah, if you're aware you're suffering from a health question and you're [indiscernible] you're willing to accept those predictions, it can alert you to sort of the period right before a mood swing that could actually make a difference. >>: [indiscernible] would not be just Twitter but email and Facebook. >> Munmun De Choudhury: We are actually sort of thinking like a plug-in so we are thinking about that option as well. >>: Notice the emails from this person [indiscernible]. feedback. Delayed response >> Munmun De Choudhury: Other kind of potential uses, diary centric systems so it could be like a log-in mechanism for people to kind of see weekly what has been my pattern of emotion and having like a diary-centric things when they go off to a doctor, they can show them what it has been. This is potentially a little bit more fine-grained and the doctor asking you like unstructured questions like how did you feel over the last one. So that is another potential option to do. >>: So most the results seems to be understandable, like [indiscernible] [indiscernible]. Was there anything intuitive or different from what you assess the results? kind of intuitive. So things like expressive or if they go to that you found that was kind of counter had expected or, in that case, how do you >> Munmun De Choudhury: I think the assessment of results in this case is not to be surprised but to be able to predict something, and although it's intuitive in hindsight, but I think there is a lot of kind of thoughts and research about what kind of factors are associated with depression and I think this kind of research kind of tells us, in a social media context, what kind of factors that one should be looking for if they're trying to predict depression. In terms of what surprised us is kind of some of the predictive power that result from the linguistic style features. So the pronoun use was pretty interesting. So first person pronoun use was pretty high for these individuals who were suffering from depression, which actually we go back, we have evidence in psycho linguistic literature that these kind of things are associated with depression. But not in a social media context. It's good to have a validation 26 of that in a social media context. So that part was pretty interesting. >>: Just another idea, I have a friend doing status books because people [indiscernible] and those things kind of [indiscernible]. But it does not matter to other people so they can ultimately predict what kind of [indiscernible] will be depressed by diagnosing pills. So maybe combined with your research, they can [indiscernible] because some people get depressed even without that. So it's [indiscernible]. >> Munmun De Choudhury: Actually that's especially true for the study with new mothers, because like you mentioned that it might be very overwhelming or there could be other contextual factors, but because there is so much of things going on, people might not realize that something might be off. And having a mechanism to do these kind of predictions can especially useful to kind of raise a red flag and make people aware when things go out of the ordinary.