>> Eric Horvitz: So it's an honor today to have Mark Dredze with us who will be talking about opportunities from social media data for public health. Mark is a research professor in computer science at Johns Hopkins University and a research scientist at the Human Language Technology Center of Excellence. He's always working with the Center for Language and Speech Processing and the Center for Population Health Information Technology. His research is in natural language processing and machine learning, and he's done work in graphical models; semi-supervised learning; information extraction; large-scale learning, which I guess is different than regular machine learning; and speech processing. And we're very excited about the work he's been doing in social media and public health, probably the closest collaborator in terms of shared vision with some of the work we're doing in this space at Microsoft Research here. As in some background, Mark did his Ph.D. work at University of Pennsylvania and has been blasting at Johns Hopkins ever since. And we were also fortunate to have one of his students with us last summer, Michael Paul, so it's -- hope to keep the collaborations and sharing of ideas and maybe even great minds going here. >> Mark Dredze: Thank you very much. So I was last here 11 years ago when I was an intern, so thanks for having me back. Hopefully I will not take that long to come back again. I was in the base class libraries team of the .NET Framework. You guys know what .NET Framework is, right? Okay. When I got here, I had no idea, and like -- because it was still really new and they had to explain like .NET on the Internet is a different thing and stuff. Anyway. So I was in that team. But we were actually not working at all on that stuff. We were the FxCop team. Does anyone -- is that still around? Does anyone know what that is? It's a static code analysis tool. Basically you would give it your code, it would run it through and say like, you know, you should be using string builder here instead of string concatenation and this is -- this is -- I don't even know what else it did. Did a bunch of things like that. >>: It's tough because you came out of MSR [inaudible] PPRC. >> Mark Dredze: Right. So my job was to be that bridge ->>: Cool. >> Mark Dredze: -- when I was here. I don't know if it still exists. Anyway, so that's what I did. I'm now doing different things, as you can see. So what I -- not that I didn't love what I did here, it's just I've done other things. So I'm talking about social media, but the definition of social media on this first slide is way more broad than it probably should be, and I'm going to actually talk about what I would call Web data in general. So let's first by talking about public health, which is where I'm especially focused. So public health is the prevention of disease, prolonging of life, and promotion of health in general. And for those of you who are seeing public health for the first time, these are the sorts of things that public health works on -- disease surveillance, study how people self-medicate, illness, vaccinations, drug use -- and here I mean illicit drug use, or recreational drug use -- tobacco use, educating people about health issues. These are all areas I actually work on within public health, but I picked them to show you kind of the breadth of things that public health is focused on. And in public health, if you've ever taken a class in it, you'll see that there's a very complicated nine- or ten-step cycle that I've summarized here in a two-step cycle just because that's really the level we need to care about today. You have population, which is everyone here, right? And then you have doctor, which I guess is Eric, maybe no one else. But also public health professionals, and not everyone has an M.D. in this field, there's Ph.D.s too, so you all can be included. And basically these two groups interact in the following way. There's surveillance, and surveillance just means information about the population going to these people. Surveillance sometimes has maybe negative implications. Here I really mean is just we're looking at what information we can get out of the population to study the health of that population. Then the doctors kind of think about it for a while, and then they develop interventions which are things that they can do to promote health, reduce disease in the population, and then they survey those interventions and repeat. And that's how public health works at a very high level. So I said before that public health is really about improving health and quality of life and population, disease, all these things. In order to do this effectively, you need data. You need data on the population. And that's really a big, big challenge of public health is how do we get this data. So traditionally this data comes from two sources, surveys and clinical visits. So surveys are we either go door to door or we call you on the phone and we say, you know, have you seen a doctor lately, do you have a primary care physician, do you suffer from asthma, are you a smoker. We actually do these things. These are CDC-funded studies, as well as other institutions. That's one way we get information. The other is clinical visits. So we go to doctors and we say how many of your patients are smokers, how many of your patients had this disease. There's certain illnesses that are mandatory reporting illnesses. If you show up with certain very rare illnesses at a hospital, that hospital has to report back to the state health agency that there was an outbreak of this illness. So that's the normal place we get data from. There's some less-known nontraditional mechanisms. For example, we sometimes sample wastewater coming out of prisons and sometimes cities to see what drugs are being taken and things like that. But that's a more niche, let's say, method. So these are really the data that people use. And this is really limited to the sort of research we can do. Because you can imagine, if you know anything about these two data sources, there's a lot of questions that you might want to ask that you cannot ask using these two methods or that's very difficult to ask. So along come social media or Web data in general. This slide was made last night, so there's a couple new social media companies that have it included, as you can imagine. I don't know if Snapshot is on here. That's the new thing, right? So social media has a tremendous amount of information in it, and you really hear Web data in general. People talk about politics, sports, entertainment, what they do for a living. They talk about what they ate for breakfast. And critically for our talk today they talk about health. And so that means here that we have an opportunity to look at social media which really is a reflection of the ongoing lives of people, right? People kind of tweet or write Facebook or do searches about the sort of things they do on a daily basis. And because health is a part of that daily-basis life for people, we can see part of that in this data. And so that has tremendous implications for both facets of this public health cycle. So in terms of the surveillance aspect, it means that we can do things that we already do in a better, faster, and cheaper way. And that's very valuable unto itself because there's a lot of things that we want to do that we do but they're slow to get results. They might take us a whole year to do the survey, for example. We can do those things faster. Really exciting, though, are new opportunities, things that we could never do before that we now can do using this data. And I have examples of both of that today. I don't have examples of intervention because I don't really focus on intervention. But I want to just tell you this does happen. So these are things like identifying people for communication, so you identify who you want to intervene with. Entailing messages specifically for them. So I'll give you one example of this. In Chicago right now there is a group that is looking through Twitter for people who mention that they got food poisoning at restaurants. And when they see that, they send those people a link to the public health departments forum to report on restaurant food poisonings. Right? And so this is a way where they're basically realizing that people mention these events, they're not being reported to the public health department and they're intervening with them to say, hey, can you give us that information, we'll go look at that restaurant for you. All right. So I'm really going to talk about the examples here under surveillance. And I'm going to talk about three types of data. Search logs, which you guys know about really well, and these are from the level I access them where we're looking just at trends, we really are able to get coarse trends, the sort of work you guys can do here where you have access to logs, you can do much more fine-grained stuff. But I won't be talking about that today. Social media, which is really very good for shallow content analysis. So you can't really go very deep into a single message, because these are very short messages, but you often gain a lot of information out of them nonetheless. And then Web forums, which are really good not for doing trend-level stuff but for really doing focused deep knowledge extraction. And I'll give examples of every one of these today. Any questions so far? Anyone want to argue about something I said? Those of you watching online cannot. You can send me an e-mail and complain. But you have to wait till the end for my e-mail address. So let's start with talking about search logs. So here's a paper that we just published recently which is one example of the sorts of questions that we can now ask using this data. So how are economic health and physical health related on a population level. And what I mean by this is when there's a recession, do people have negative, presumably negative, health outcomes because of that recession. We know that the recession affects a lot of things besides jobs. For example, the divorce rate is heavily influenced by recessions. You can talk to me later about why that is. Not in the way you'd think, actually. But we want to ask here does a recession increase, for example, stress, physical pain, those sorts of things. And the difficulty here is getting the data to do this sort of study. Right? We need a large population, we need a long span of time. We want to compare things before and during or after recession. And we have questions about many different ailments. You know, I'm just saying health in general. There's a lot of different questions we might want to ask people about health. So we're going to get data from Google Trends. Which, if you goes know this, is basically Google publishes trend data on their kind of most popular queries. I don't remember how many of the most popular queries. But if a query has a certain number of million people searching it, they will post trends. And this is mostly for doing things like realizing that Justin Bieber is really popular. Again, I thought he had gone away. But that's kind of the idea behind Google Trends. But we're going to use it for health. So what we did is we looked at 343 queries that we identified using the couple seed queries then Google kind of recommends additional queries for you to look at, and then we hand went through them and picked out the ones that actually are reflective of health. And we looked at the top 100 of those queries that increased during December 2008 to the end of 2011, which was the great recession. And we're going to factor in like overall search traffic volumes and other things. You have to get that right so you're not chasing ghosts in the data. So what were the biggest increases? So we found headache-related queries went up 41 percent; hernia related, 37. That's the best hernia picture I could fine. Chest pain-related, 35 percent. And when I say related, I mean a couple different queries all searching about headache related items. And then the single query that had the biggest spike was stomach/ulcer symptoms, which went up 228 percent during the time period. So let me show you what this actually looks like. Every one of these lines is one of those 100 queries. So you see any given query has a lot of variance to it. But if you look at the trend over time, here this gray area here is the recession. So you can see that before the recession, well, you know, it kind of goes up and down a little bit over time. But there's this really noticeable uptick during the recession here. So I'll just show you one query -or one set of queries, headache symptoms. Actually, this might be just the query, headache symptoms. I'm not -- I don't remember offhand. And you can see that it kind of varies quite a bit. This is a linear fit to the data before the recession started. So you see a slight decrease in the slope. I think basically the way to read that is it's going to continue on. I don't think the slope really should be decreasing slightly. But you can see that the increase here is clearly very different from what was going on. And this is where the recession -- this is where we said December 2008. The recession actually starts a little bit before. Whatever. It depends how you measure the recession. So that's just an example of the sort of population level questions we can ask just by looking at what people search for. And you guys know here the work that Eric and others is doing -- are doing, are asking really quite interesting sophisticated questions on this data and getting an answer that we normally would never have access to. And that's really quite exciting. Yeah. >>: So a question. So there's actually a pretty big increase from 2008 to 2009. >> Mark Dredze: Oh, you mean right in here? >>: Yeah, just ->> Mark Dredze: So ->>: [inaudible] little bit earlier, right? I'm just wondering did you guys look to see if that was -- do you see that type of thing and might it be predictive? >> Mark Dredze: So that's a good question. I don't know. So, first of all, when the recession starts is a little bit difficult. So I don't remember the dates of like Lehman Brothers collapse and all that, but I remember September 2008 it became apparent there was something going on. We just probably -- let's see, June, September -- it's like right around here. I don't know exactly why we use December as the cutoff. It might be that we're basing it on some other like official recession criteria. I don't really remember. But even if you just know kind of the history, you can see that something has changed around here from what happened before. And then any one query or even a small set of queries is very hard to generalize from because there's a lot of things that can be going on. And so if you look at the overall trend in this line here, that's when it really convinces us something's going on. We have statistical significance numbers and all that in the paper and such that you can look at. I'm giving you just kind of the general plots here. Okay. Yeah, you have to -- I mean, with like anything, you have to be careful not to overfit too much. So we try and make some really simple general statements we can make about the data. All right. So that's all I wanted to show you on search logs. Let's talk about social media. I'll speak about this for a little while longer. So I'm really talking about Twitter data here, although we're looking at a bunch of other social media sites as well. But Twitter is really the easiest to work with. We don't have firehose access, but even just using the 1 percent API you can get a lot of good data. So the key thing here is we're getting -- I'm sorry, no, we're not getting this. There are 500 million messages a day, so that's a lot of data to work with. Okay. And I assume everyone here knows what Twitter is. So the reason we're using it is not just because of the size of the data set, it's because how it really does show up in this dataset. So the first tweet here, nothing like waiting in line to buy cigarettes behind a guy in a business suit buying gasoline with $10 in dimes. So this person is not saying I am a smoker, which is what I actually care about, but they're saying something that actually indicates that they do smoke or they're probably a smoker. Obviously there's a million reasons why -- well, there's not a million reasons. There's a couple reasons why you could buying cigarettes, but it's not a high entropy distribution over those reasons, so this person's probably a smoker. And the point is we see a lot of tweets like this that are indicative of health and we can glean health information from, even though the person is not directly trying to report health to us. And actually we're starting some new work on tobacco use, and so these tweets are exactly the sort of thing we're going to be looking for. So let me just give you some examples of the breadth of things that we've done. So we've looked at medication use, how people -- for example, seeing that people use Benadryl to treat insomnia in addition to allergies, looking at patient safety issues, so these are people reporting that their doctor has made a mistake, either surgical error or prescription error. Mental health, this is some new work on posttraumaticic stress disorder. So there's really a huge diversity of things you can do with this data. I'm specifically going to talk about disease surveillance today, which is the work that's gotten the most attention in the space of Twitter health stuff. So I'm going to tell you a little bit about what we're doing with disease surveillance. So last year many of you remember was a really unique year when it comes to the flu. Not unique once in a hundred years, but unique once in a couple of years, once in a decade, where there really was a very large flu outbreak in the United States. It reached epidemic levels. And you might have known this not just because you were sick or your friends were sick but because there was a large covering of this event in the media. This is a New York Times article from the beginning of 2013 where they're basically saying, you know, this is a really big flu year. So this received a lot of attention. The reason it receives a lot of attention is not just because flu is inconvenience, but flu is very dangerous. Most people who get the flu recover and they miss a couple days of work and they're fine, but so many people get the flu and there is a percentage of them that have serious health implications of that including death. And the flu kills a lot of people in the United States every year. I don't -- I don't have the statistics offhand, but it predominantly affects the elderly and the young. Okay. One of the reasons swine flu was so big is because that distribution who -- over who it affects was skewed and affected kind of normally what we would consider healthy people that aren't affected by the flu to that degree. Anyway, so knowing what's going on in the flu is actually a really serious health concern in the United States. Because of that, the CDC invests a lot of time and effort into this problem, and they have a really good influenza surveillance network. It's called Flu View, at least that's what they call it when they post the information online. This is a nationwide surveillance network. It has 2700 outpatient centers, which is hospitals as well as doctors offices. They're reporting ILI, which means influenza-like illness. What that means is you go into the doctor and the doctor says, well, you have a fever, you have the chills; I bet you have the flu. Go home and get some rest. Right? Or, here, take some flu medicine. But there's no necessarily -- there's not necessarily a confirmation that for sure you have the flu. And that's what you expect, right? The doctor is not going to order a lab test every time you go in with the flu. Okay. So that's what gets reported. So I mentioned that in detail just so you understand that the numbers that we're going to talk about as a gold standard here from the CDC are by no means a gold standard. Right? Not everyone there has the flu. And 2700 clinics is a lot, but it doesn't give you a perfect reading into what's going on in the United States. All right. So while this does give you very good data, the major cons are it's slow. It takes them about two weeks to collect this information and to publish it. And so the flu rates that get published every Friday are two weeks old. And they also get updated over time. So they revise those estimates as more and more reports come in, and we've recently realized they can revise them quite a bit. And there's also varying levels of geographic granularity. So they're looking at a national level. They also look by region. There's ten regions in the United States they divide up into. But they don't look at the C level because they just don't have enough information to look at that. So Twitter is a very attractive source for providing new data for influenza surveillance. Because we can do things real time. Right? I can analyze suites in real time and tell you what the flu rate is today. And also it has the potential of geographic specificity. So the questions are: Can we do this accurately, that will be our first question; can we do it with geographic specificity; and can we really do it in real time. And what I mean by this is there's a lot of people who will take these -- take a dataset and take a method for doing flu surveillance and show you that for last year and two years ago they could do a really good job. And so the results I'm going to present you today on the 2012-'13 flu season, we actually built the season -- we built our predictive system before the season started and then ran it on the season and then published the result at the end. If you don't believe me, the NAACL submission deadline where we published it was in I think December 2012. So that's when we built the system, and it was before the flu season was really in gear. Okay. All right. So the first thing is can we find flu on Twitter. And if you just look for the keyword flu, you can find a lot of examples of it, but they're probably not what you're looking for. So here are some of my favorites. The Ray Lewis flu. Does anyone know what the Ray Lewis flu is? Man who lived in Baltimore, you don't know the Ray Lewis -- the Ray Lewis flu is play -- so does everyone know who Ray Lewis is? Okay. Ray Lewis -- I know, you guys Seahawks fans? >>: [inaudible]. >> Mark Dredze: You know? Okay. That's right. Congratulations, by the way. Okay. Did you know who he was before the Super Bowl was the question anyway. So the Ray Lewis -- so basically Ray Lewis was the -- a line -- he was a linebacker for the Ravens, and he was particularly ferocious and players would get physically ill the week before playing him in anticipation of what we would do. It's called the Ray Lewis flu. Okay. Anyway, so you can, whatever, read the article. Does anyone know what swag flu is? >>: We call it the Sherman syndrome here. [inaudible] swag? >> Mark Dredze: Swag flu. Does anyone know what the swag flu is? So you should look it up in Urban Dictionary. It's really quite entertaining. The swag flu -- oh, I have it written down here. A contagious virus that spreads game, confidence, and swagger among a population of individuals. So while this might be contagious, this is not really what we're thinking of when we talked about flu. But if you actually just find tweets with the word flu, you will find all of these as well as others that I'm not pointing out. Right? And that's not really what we want. So this is not just a cute problem, it really is a problem. So here's a plot of flu-related keywords -- I'm sorry, this is not actually keywords. It doesn't matter where we get those. These are flu-related tweets for the summer of 2009. In the summer there should not be much flu. But you see there's a pretty significant increase right here early June. Does anyone remember what was going on in the world of health in the summer 2009? >>: Mad cow? >> Mark Dredze: Mad cow? No. I mean, that's always happening in some degree, but... so don't feel too bad. I gave this talk to a public health department -- not avian flu. You're very close. I gave this talk to a public health department; no one there knew either, so don't feel so bad. >>: [inaudible]. >> Mark Dredze: It was swine flu. Swine flu really broke -- spring 2009 was when it came out, and then everyone was getting ready for the fall of 2009 when it was going to be big. So the summer of 2009, that's what was going on. And we have a blip here. And actually what that blip is is the World Health Organization announced the beginning of June, June 11th, that swine flu is going to be a pandemic. So our system identifies an increase in flu when there really is no increase in flu there. All that's going on is that more people on Twitter are talking about flu because this press release came out. So this is not just you get cute examples, but your system is really misled by things if you don't do a good job of identifying exactly the tweets you want. So what we do is we're going to use some statistical classifiers and some NLP features and [inaudible] speech tag and all this good stuff and we're going to have a three-stage system here. The first stage is we identify tweets that are health related, right? So we have basically a set of health keywords that we filter on, and then even within those keywords we have a classifier that says is this related to health or not. We then take the health-related tweets and say is this about flu or not, and specifically influenza. And the third thing we do is we say is this tweet showing an awareness of the flu or is it actually showing that this person is infected with the flu. So I'll give you examples of that. So many Americans seem to have bad flu right now. I'm worried this trend will reach New Zealand in winter. I might need to step up my reclusiveness. So this person doesn't have the flu. They are expressing awareness of the flu. That's very different from my flu, cough and fever is getting worse. I'm worried. This person has a flu infection. That's what this last stage tells the difference between, and that turns out to be really important. So we're going to use these classifiers, and we're going to find lots of examples of these tweets. The next question is how do we identify geographically where this is happening. So Twitter gives us geo tags, which is great. And we'd like to use those in order to geo tag the data we have. There is now -- in addition to our work, there's one other paper that I know of that is doing nonnational measurements from Twitter data of the flu, right, and it came out the same time as ours. So before that the real focus was on national, and remember when I told you about the CDC data is one of the weaknesses there is looking at local data. So we really want to use Twitter to try and look at local data, not just national data. So can we use Twitter to attract these local trends. So is it accurate enough, you know, do we have enough data, is it accurate enough for these finer-grained locations and is there enough data. So only this number is actually a little higher now. In the U.S. it's about 3 percent of tweets are geo coded. But that's a tiny fraction of what we actually want to use. So what we did is we looked at profile information, and it turns out that people can specify strings that are indicative of location. So this person Allison says they're in New York, this person Ashley says they're in Florida. Here we have someone says they're Arizona. So we can extract some information about location just by looking at profile information even if the person isn't geo tagging their tweets. We also have more challenging cases like New York, which our system can do. There's many variations of New York with Es and Os in all sorts places. Does anyone know where that is? >>: New Jersey. >> Mark Dredze: Right, very good. Our system doesn't know that. You could imagine doing it with a phonetic model of language and such, but that's a little too much. So we're going to pass up that one. And you can see the influence Justin Bieber has on tweeter, if you weren't aware. So we built a very simple system called Carmen, where Carmen takes those strings and resolves them to one of about 4,000 locations that it knows about. And just we have a whole paper on this if you're curious, but the one number you have to know is we go from about a dataset with 1 percent geolocated tweets to 22 percent geolocated tweets. And you can use our code. It's available in Python and Java. It's online. You can extend it, whatever you want. But it's available. So we use this to get a lot more data than the geo tag data that we're provided with. So the last question is can -- does this actually work. So let me show you some historical data first. So this is 2009, 2011. So 2009 was a really easy year to predict. Because of the way swine flu happened, there was a huge uptick and then a huge drop. And basically as long as you predict the flu goes up and then down, you're going to do a very good job of predicting 2009. And so a keyword-based system, which is just these are using keywords, so health and human services, they ran a competition a couple years ago, but they're things like flu, influenza, things like that. This is our flu classifier which is just is this about flu, not is it about infection. And this is the Google Flu Trends data. If you know Google Flu Trends, great. If you don't, they're using queries in order to predict the flu rate, and we can talk about that later, how they do it. So these look really similar. But in 2011 it was a very mild season, so it wasn't kind of this huge increase and decrease. And therefore a lot of systems really struggled. So here is how our system did where we're actually just looking at infection tweets nationally. You could see it does a little bit better in 2009, but who cares. In 2011 it cuts the error in half between the best Twitter method and what Google Flu Trends has. So really substantial improvements, but hard to do for 2012/2013. So this is we built the system and then we ran it on data as it was happening. So the black line here is the CDC data. The blue line here is the infection data. That's our system. And then the dotted line is just our influenza tweets, not filtered down by infection. So you can see basic, I mean, well, it's quite obvious, the solid blue line is much better according to the black line. The dash line really is two points that seem to stand out as wrong. One is this point here that's fairly obvious. So that's the early January media coverage I mentioned. And then this point here, which is actually when the CDC made an announcement that there was a flu epidemic, so this is the CDC announced it, and then a month later CNN said, oh, wow, we should do a story on it, and then New York Times and everyone did a story. So this is -- these two increases are really just people talking about flu because it's in the news, but they're not actually saying -- they're not actually saying that they're infected. So I don't have a great sense of other systems. I can tell you that about this point in the season a couple other people published numbers of how well they were doing, and they were like .66. Oh, our number is .93. So we do really, really well. Other people around here were doing a little bit worse. I don't remember how Google did, but Google actually suffered a little bit of this problem. And there was a really good article in I think Science News about Google Flu Trends really overestimating. And Google actually published a paper recently where they said why that was a problem and how they were fixing it. So we're actually doing very similar to how Google is doing on this task. >>: Question. >> Mark Dredze: Yes. >>: The correlations for template concurrent time periods. >> Mark Dredze: Yeah. >>: [inaudible] this is every week? >> Mark Dredze: So we're looking at a week by week. So each data point is one week. >>: Ah. Okay. >> Mark Dredze: So I'll just say one other thing because you said -- you said something that made me think of this. So we then took these numbers and we chose not to believe them and tried to do all the statistical analyses we could to show this is not actually a significant improvement. So we looked auto correlations, cross correlations, shifting things. The first differential of the -- I don't remember what they're called. We had some statistician friends working with us. And those are all published in our [inaudible] paper, and after doing all of that we still are doing a really good job of predicting the flu. And then the other number I'll just throw out there is if we just do a really simple system which says look at the weeks where there was big difference, like an above average difference in the flu rate, compared to the previous week that came before it and ask our system is this difference going up our down, our system got a hundred percent. Right? So that's what I think is a more useful number. When someone looks at this, they're not going to care necessarily about these little points here about exactly what it was. What they want to know is is this week worse or better. And on that question we did as well as you can. Okay. So the next thing we did was we want to see if this was going to be useful at a local level. So we looked at New York City. The reason we looked at New York City is they're one of the few if only public health departments for cities that publish data online. There are other cities that have this data, but it's either not nearly as good as New York City or they don't publish online. Apparently New York had a mayor that was really into public health. Anyone follows New York politics? So he invested a lot of resources in that city, and they have a really great public health infrastructure as a result. So they have numbers like this. So they actually published some of the numbers online ->>: They were also invested in John Hopkins. >> Mark Dredze: He also gave us over a billion dollars [inaudible]. He's a good guy. I like him. You can put that on the Web. So they do not publish numbers we actually care about, though, which is doing counts and not percentages. The difference is not important here. But what this allowed us to do is a blind study. So rather than downloading the data and running a correlation, which you might not believe us, we sent them our predictions and we asked them to do the correlation for us based on their data. So if you don't believe me about the first thing, you might think I'm lying about that too, but you can call them and they'll verify it. So this is what they told us how we were doing. Our faction curve was .88, our keywords are .72, comparing that to the national curve, it's -- it's barely statistically significantly different. So that means we're almost doing as well just on New York City as we would on the national trends. And that's very encouraging because it means that if we can actually predict things in a local way, then we can use the system to help enable local decisions about how to respond to influenza. And really a lot of these decisions are made on a local basis. Knowing the national rate is helpful, but if you need to close a school district or run a vaccination campaign, you want to know that in a local level. So just the last thing, let's go back to this curve. So, remember, this curve now I can tell you is just influenza-identified tweets using a classifier. This curve is infection tweets. And you can see the blips disappear. And in the summer the infection curve doesn't notice any difference. There's no change in the infection rate. So I'll show you this. This is one of the first times I'm talking about this. We really want to give public health researchers as much of the data I've shown you here as possible to enable them to make decisions. So we're building a Web site that will do exactly that. And this is not available yet. If you're interested in it, send me an e-mail, and I can share it with you when we go into our private beta. But we're really looking to this sort of platform to enable public health researchers to get direct access to our data, not as a published paper the year after, but in real time to help them make these decisions. Okay. So that's what I wanted to say about social media. The last thing I wanted to talk about today is Web forums. And this is an area where a couple people in the field are working in, but not nearly as many as Twitter. And I really think there's a tremendous value here that people have been overlooking. So we've done a number of things here. We've looked at quality of care based on what people think of their doctors, looking at doctor ratings, for example. We're looking at prescription drug use, which is something you guys have done as well. But what I really want to talk about today is something that I think is the most exciting. We're just talking about elicit drug use. And hopefully the money next to the pills give you the elicit feeling and not just that this is someone's prescription. Well, it might be someone's prescription, but not the person taking it necessarily. All right. So ->>: Just tells us it's the U.S. as opposed to Canada. >> Mark Dredze: It tells you it's the U.S. Right. Yes. I know most about the elicit drug market of the United States. That made this sound weird. Okay. I think you all know what I'm trying to say. All right. Very good. So let me first tell you the dataset that we're using. So we're using a site called Drugs-Forum. If you've never heard of this site, it's the sort of thing that you really can't believe exists. So I'll tell you what they say they are. Drugs-Forum is an information hub of high standards in a platform where people can freely discuss recreational drugs in a mature, intelligent manner. Their words; not mine. You might disagree on the mature, intelligent manner of the site. Drugs-Forum offers a wealth of quality information and discussion of drug-related politics in addition to assistance for members struggling with addition. So this last part may be true, but this part, freely discuss recreation drugs, really dominates the site. That's why people go there. They don't go there to talk about policy; they go there because they're talking about how to get high. So we have 100,000 messages, 20,000 users. We have self-reported information of those users. So the user of the site, skew male, that is not because recreational drugs users skew male in that degree, it's just because of who's using this Web site. So we know that's not true by other data. 50 percent of these people say they live in the United States. And then this is the age breakdown of the younger groups. It goes up higher. I do not believe this number at all because it says that 58 percent of the site is between 20 and 29. You're not allowed to use this Web site if you're under the age of 18, and I do not believe that means that those under the age of 18 choose not to use it; I think they just lie about their age and they will probably just select this box. So I do not trust these numbers. These numbers I kind of trust a little bit more. All right. So why do we want to use this data? So we're really interested in Web-based drug research. And to help you understand why, I need to tell you a little bit about what's going on in the world of elicit drugs. So if you don't know anything about this, you might assume that the world of drugs is cocaine and heroin and LSD and meth. Anyone else want to throw one out there? Always interested to see who volunteers which drug. Okay. Let's say those. We're all comfortable with those. Everyone's heard of that. Those are the big ones. Right? And that's when you hear about drug addition. That's what we often here about. What's going on now, though, is that there's new synthetic drugs that are coming to market at an ever-increasing pace. In 2011 [inaudible] recorded 52 new drugs in the year. So it's an average of one a week. I don't know if they were spaced out like that. And what's going on is that people find some drug, right, they tinker with it in a lab, they have a new chemical compound, and they produce a new drug. And they do this because they're experimenting to try and come up with the next big thing. They're also doing this in response to policy. So once we make one drug illegal, if you go back to the lab and you can tinker with it so you have a different drug, right, you can then market that again as something else. Because these are new drugs, they're not illegal when they come to market. And so you could often legally purchase these things. And they often have names that try and mask what's supposed to be going on so that people don't realize what they're for. So they're often called incense, spice is another name. Has anyone heard of bath salts? So bath salts are not bath -- like they're not for baths. Okay? They are called that because of how they look, but it's completely unrelated. It's another drug. Does anyone know why bath salts are really popular all of a sudden, in the past year or so? You nodded. Do you know why? >>: Yeah, I'm assuming it's really cheap and easy to manufacture. >> Mark Dredze: Cheap -- those things are true. But that's not why. A lot of people now have heard of it that didn't before. If you go look at your logs, you'll see a big spike in bath salt queries. So about a year ago there was an article on the man in Florida who turned into a cannibal because he took bath salts. Does anyone remember this? >>: Yes. >> Mark Dredze: Yeah. Oh, yeah. So that's not actually true. He didn't take bath salts. It was misreporting. But everyone heard about bath salts after that. And so bath salts is the example of something that about a year ago really got the attention of a lot of people in the drug enforcement -- and then now I believe they're illegal -- but the drug addiction communities as this big new thing, even though we have posts on it that go back five, six years. So what's going on is you have these new things being introduced, and they largely fly under the radar because they're new and not a lot of people know about them. But the drug users certainly know about them. And this poses a really big problem for doctors. So if someone shows up in the ER and they've overdosed, right, you want to know what they took and how much. And if they can't tell you or you never heard of the drug before and you don't know how much of that drug is a normal amount, then it's very difficult to do anything. The same is true of people who work with addiction, right, they work with teenagers who are fighting addiction, and they never heard of the drug before, it's very hard to try and help that person. So there's a huge need for information and really not a good place for this information to come from. So the sort of questions we might have about a drug like Salvia, and I'll get to Salvia in a minute, how does this drug vary by demographic group, in other words, who's taking it, who's using it, what are the effects, what are the dosages of Salvia. Has anyone heard of Salvia? It's a little more popular of a drug. Okay. Why -- do you guys know why Salvia is popular? Not cannibalism. >>: Because it's legal, right? >> Mark Dredze: It's not -- I don't think it's legal anymore. But it was for a while, you're right. But what made it really popular? Do you know why? So if you go on YouTube and you search for videos, there are a lot of videos of high schoolers taking Salvia and then recording themselves with webcams and uploading it to YouTube. It was like a whole big thing. But this got started ->>: Hanging around with the wrong crowd. >> Mark Dredze: You got to look at YouTube. YouTube is the place. I'm not exaggerating. YouTube has tons of drug information. There are people who study YouTube for that reason. So Miley Cyrus, there's a video of her on a birthday party ->>: Yes. >> Mark Dredze: You know this? >>: We're hearing about this, the controversy with her and her drug use. >> Mark Dredze: Yes. So there's a picture of her -- not a picture. There's a video of her on YouTube where she is smoking Salvia, and that was the big revelation to a lot of high school kids of, oh, there's this drug called Salvia, and actually really increased the usage. That's why it became very popular, Miley Cyrus. I'm not going to play the video for you guys. You guys will have to go see it for yourself. TMZ. It's from TMZ. That's the place to -- anyway. >>: [inaudible] my son [inaudible]. >> Mark Dredze: But, by the way, it is interesting to note, the second video here, it recommends Justin Bieber smoking weed. So I point this out not just because it's funny but because there really is a lot of information on YouTube about this, about what's going on. It influences popularity. And people in the drug research community, public health, they don't really have any of this on their radar. So what we wanted to do is take all of these drug messages from these drug forums and kind of summarize them in some way and present them in some way to people who actually need to read them to learn about what's going on so that they don't have to sit there and read through every one. And because we know NLP and machine learning, we thought we could do something interesting. So we used a model called factorial LDA, which I'm just going to explain very briefly here. It's a topic model, so if you're familiar with topic models like latent Dirichlet allocation, it's the same idea. So word tokens are associated with not a topic, a latent topic, but instead a set of latent variables. All right? And that set of latent variables allows us to jointly model multiple things at once. For example, we might be interested in modeling topic and perspective or sentiment, and modeling these things all at the same time instead of a topic model where you do these things individually as different topics. So that means instead of a distribution over topics, we have a distribution over what we call tuples, where each tuple has its own word distribution, just like each topic would. Ignore the last line, because I don't have time to talk about that. So we looked at a three-factored model for this dataset where the drugs -- we had 24 drugs -- you see, I could only name four offhand, but there's 24 in this dataset. So like tobacco, amphetamines, alcohol showed up as well. So 24 drugs. That's one factor of model. The second factor is the way it's delivered, and we just looked at injection, oral, smoking, and snorting. There are a lot of other ways to get drugs in your system that we didn't look at here. And then we looked at five aspects of usage, the chemistry of the drug, the cultural around it. And you see really interesting differences in the culture around drugs. Looking at alcohol versus heroin, for example, you see really different words showing up. The effects of that drug on a person, the health implications of using that drug, and information on usage, like dose and preparation. So what we're going to do is look at tuples of these things, like cannabis, oral, culture, for example. So the way factorial LDA does this, I mentioned it captures these tuples, here are some examples of what it can learn for two tuples. Cocaine, snorting, health and cocaine, snorting, usage. And you can see that health you have things like nose, blood, things like that. So nosebleeds. Whereas the usage, coke, lines, cut, right? And even if you don't know anything about cocaine, this should make sense to you. And critically what the model is doing is it's saying this word list here and this word list here, they should be similar in that they're both talking about cocaine and snorting. The difference should only be in the influence of health versus usage. The way we do that, right, is we don't want to actually learn every possible combination of distribution here, because that would be too many to learn independently. So we need to tie these distributions together in such a way that it encourages these two things to look at similar as possible except that one difference. And if you looked at a much longer list of the top words, you would see that they look much similar, similar. We just look at the top ones. So the way we do that is the system learns for each independent factor a list of words that it thinks are useful. So here are general words about cannabis, general words about oral use, and general words about chemistry, where the oral use is all drugs in all situations. This is all chemistry words. Right? These are learned by our model. And what the model is going to do is when it is interested in generating the list that corresponds to cannabis, oral, and chemistry, it's going to combine these lists together to give us a distribution that it thinks best captures all of these, right, using this log linear model. So this list here is what the model thinks cannabis, oral, and chemistry should use, this distribution. That's actually too strong a constraint for our model because in actuality there might be some combination of these words in practice that we can't quite model, and so we want to allow it to deviate a little bit. And so what we do is we actually sample a distribution using this as a prior, and the posterior is going to be a little bit different, right? And the posterior is going to be capturing the nuances of the combination of cannabis, oral, and chemistry in the data. So take a look at this word list and think about cannabis, oral, and chemistry. What do you think these messages all talk about? >>: Baking products ->> Mark Dredze: Baking products ->>: Edibles. >> Mark Dredze: Edibles [inaudible] right? So sure enough if we look through the data and we say what are the messages that are talking about these three things and we've learned these word lists from the data, we get all these messages about how to make pot brownies. All right. So let's go back to our example questions. We don't really actually care about pot brownies so much. So for something like Salvia, we want to know how many people are using Salvia, what are the effects of it. And so we're going to use our tool of f-LDA as an extractive summarization system. So what we're going to do is we're going to look at all sentences in our corpus and we're going to pull out sentences that are the best examples of a specific tuple. So of the tuple cannabis, oral, usage, we're going to look for the sentence that if we look at all the tokens of that sentence and the probability of their assignment to the latent tuples, give us a sentence that is the best example of that. So that's what I'm going to show you here. So what we looked at, for example, is how people are using Salvia. So Salvia smoking uses and Salvia oral usage. So these two sentences are best reflective of these two tuples, and they only differ in that this one will be smoking and this one will be oral. And here's what we get. The best way is to use a torch lighter, bong or pipe, bong recommended -- not my recommendation, that's the site -- and hold in each hit 20 to 40 seconds. So it's clear inside that's smoking Salvia. Here, this is a little technical, but this is about the dosage of orally ingesting Salvia. >>: Yet don't use square brackets on that. >> Mark Dredze: Right. That's -- that's -- that's verbatim from the text. I did not put in recommended. What are the effects of Salvia? So here the only difference is in smoking and oral. And so you can see that this one, for example, is when chewed, the first effects are felt after about 15 minutes. Blah, blah, blah. Whereas this one is talking more about, you know, he took one large hit -- so again it's smoking -- and then held it in, laid back, blah, blah, blah, orangish brownish light. Okay. So actually I just point out one interesting thing about this data. It says he then took one large. So who's he? Right? Who are they talking about? So because these drugs are often legal, but not always, the site forbids discussions about you using drugs. You're not allowed to say that you use drugs or where you got them. So people have creative ways of expressing this information. So they're always talking about their friend who uses drugs. So this person is talking about he, meaning his friend, who took this drug, and then a very detailed description of what the friend had. But you have other things like people write SWIM, which stands for someone who isn't me, so SWIM took a shot of whatever last night. And then someone will reply SWIY, which means someone who isn't you. We also have much more creative things, like someone will write my pet rabbit Harry, parentheses, 6'2" male, 250 pounds, went to a club with his girlfriend last night and did the following things. So, yeah, I think it's an interesting artifact of the data. So what we did is we showed these -- we did two things. One is we showed these to colleagues of ours who are in the medical school and work on these, and they said these look awesome, good job, and they gave us a pat on the back. But we can't publish based on that. So what we then did is we took these snippets and we took technical reports that were written about these drugs, so we have a technical report that was written about Salvia, and we excerpted the paragraph that talks about how do use Salvia if you smoke it. And we showed that to a user, and then we showed them a bunch of these sentences, and we said which of these sentences is most helpful in writing this tech report, this paragraph. And the sentences came from our models I discussed plus a couple baseline models. And people picked the sentences from our model more often than anything else. And they were saying if I had to write this tech report, which is exactly what these people are doing in the school of medicine, among other things, our examples are going to be the most helpful for writing those reports. So we're really excited because we really think this means that this is a great tool for mining large amounts of data and really finding the information that these public health experts and our friends in the medical school really want to know. So those are the examples I wanted to give you today from these three different areas. There's a lot of people who went into making all this happen, and so I wanted to thank them here. If you guys are interested in other examples that I've kind of referenced but didn't go into the details today, we have a Web site, socialmediaandhealthresearch -socialmediahealthresearch.org, which is the group at Johns Hopkins working on this, as well as you can look at my own Web site and e-mail me. And specifically if you want code and data, I'm always happy to share. Thank you. I'm happy to take questions. [applause] >> Eric Horvitz: Any questions or comments? >> Mark Dredze: I don't know anything about specific drugs, so -- and I'm being recorded, so don't ask me those sorts of questions. Yes. >>: What are your thoughts on one thing that comes up a lot is -- just thinking about the world of Twitter and corresponding to CDC data, one thing that comes up a lot is, okay, so they correlate it pretty highly but not perfectly so which one's right or more accurate, it's along some dimension, so that's kind of a perpetual problem. >> Mark Dredze: Yes. >>: What -- I don't know, thoughts on that? >> Mark Dredze: So obviously there's a lot of things we could track, so let's talk about influenza as an example, because that's what I talked about today. So, first of all, there's also the question of good enough. Right? So how much information do you need in order to make decisions. And it really depends on what decision you're trying to make. So there are a lot of things that people can do in response to influenza. So, for example, something we're looking at right now is hospitals, when they know there's going to be a big outbreak, more people come to the ER, so they need more doctors on call, they need more staff there and they need more beds. So that's something that's not that hard to deal with if you know it's coming. So even if you have a fuzzy signal, it might be enough to make those kind of decisions whereas if you don't and you're caught off guard, it can be a real problem because your ER gets overwhelmed. You also don't want a lot of sick people sitting in your ER all the time. ERs are usually not healthy places to be. Whereas school closings, which happens -- so, for example, during the swine flu, a lot of schools in New York were closed, other places as well. Because when schools are open, parents send their kids, right? Like if you have kids, you kind of pray the kid's better and you send them to school even though when the schools say 24 hours without fever before you can send the kid back, right? But you're like, oh, we're routing up to get 24 hours sometimes. So they will close schools because schools are major spot of infection. So do that, you might want to have much better information. Usually there they actually want to have diagnosed cases in the school. There's a lot of things in between. There's vaccine campaigns, there's running advertisements. So I see ads all the time in Baltimore that say like stay home if you feel sick, which makes a big difference to other people getting sick. So I think we're certainly at the level of accuracy that we can do many of those things, but certainly not all of them. So that's number one. So that's in terms of good enough I would say, like it's doing well. In terms of the gold standard, that's really a problem here is that we don't have the gold standard. You have this problem all over machine learning, right? So for Web search, right, what is the right page to show as the number one result? And you often just don't know. Sometimes you do navigational queries, you kind of [inaudible]. But oftentimes you just don't know what the right answer is. And so you don't really have a good gold standard to evaluate your answer. All you can say is are people happier, are people -- you know, however we measure happiness, through click-through rates, by advertising dollars, through the user studies, are people happier with what we're doing. And I think the same thing is true here. We don't really have a great gold standard. I mean, there's the CDC data, but it's not perfect. So we have to look at a lot of different things. We have to look for is this information useful, is it being delivered in a more timely manner, can we do things with it. So one of the things the CDC is doing this year is they're running a competition to see who can predict influenza, predict meaning what's happening next week, not what's going on right now. So everyone was really excited about this, but we all said but we don't do prediction, we do surveillance, so the CDC says, well, do prediction. So our theory is that you can do a better job of prediction with this data, even though it might not be as accurate, because it's more timely. So even if there's not a gold standard or -- to evaluate against, there's still many ways that we can see that the data is helping us to do a variety of things. What else? I give very long answers. I can try and cut them down if you want. Okay. So thank you -- yeah, go for it. >>: So obviously you're looking at Twitter data ->> Mark Dredze: Yes. >>: -- and you can track back to the [inaudible]. Has your IRB said anything about the use of that data? Are they just considering it secondary data that's just public? >> Mark Dredze: Yes. >>: Are there any issues that they've -- that you've thought about or they are thinking about? >> Mark Dredze: Okay. So we could easily do an hour talk on this topic. And I'm not saying that as a joke. I have colleagues in the bioethics department at Hopkins who specialize in social media and they do give hour talks on the topic. I will try and give you a short answer. So ->> Eric Horvitz: We have until 3:00, so just a very good topic. >> Mark Dredze: Oh, really. Okay. Well, let me give you a short answer, and you can follow up questions if you want. >>: This is an important topic for us here. >> Mark Dredze: Right. So let me just say a couple general comments. So one is some people say we need to understand the difference between privacy and perceived privacy. That's a big one. So even though -- and you guys know all about this, right -- people have signed agreements about how they're using their services to give you data, but it doesn't mean they won't get really furious if you use the data in such a way. Right? I mean, if you remember when Gmail came out people read the user agreement and got really furious about what the user agreement said Google's doing, even though Google wasn't doing any of those things. So those differences in perception and reality and perceived privacy and real privacy, so that's something to keep in mind. There is aggregation, right? So are we looking at individuals or are we aggregating over populations, that makes a big difference to a lot of people. When it comes to the IRB, there is an exemption for publically available data, right? There's many exemptions, but that's the one that really is key for us. And every piece of data we're using is publically available, freely to download from the Internet. Anyone has access to it. And that is enough of an exemption that the IRB has given us a blanket exemption for the work. If we were to do things like look at an individual user or publish things that we've inferred about the user, right, I would definitely want to go back and talk to the IRB about that, right? And that's more close to things that the IRB is very concerned about. >>: But what about even general characteristics, like people can smoke Stevia, or whatever it's pronounced, are likely to be between the ages 18 and 24 and live in cities. >> Mark Dredze: Right. So my understanding right now is that ->>: [inaudible] you're not looking at ->> Mark Dredze: Exactly. So this is -- you know, it's the same question of what is public health. So you guys all talk about what is public health. So one person is not public health. A million people is public health. And the dividing line is somewhere in between. Right? So it's the same thing with aggregate, right? So looking at the whole U.S. population, everyone says -- everyone I think can agree is aggregate. But as you overlay these different demographics on top of each other, which is something we're definitely thinking about doing, you start getting down to the level of one person, right? And as you get closer and closer to one person in a small group of people, then you have to ask these questions of how meaningful is the aggregation. And that's actually something that when we start to get to that point, that's another conversation that we need to have with the IRB about those social issues. >>: Right. And I think doing that segmentation is really important because say there's a really big health problem with people doing whatever with bath salts, we know that it's mostly like a senior in high school thing, that you can tailor your message and tailor how to target people for interventions. >> Mark Dredze: Yeah. >>: Right? >> Mark Dredze: When we get to interventions, those for sure need an IRB approval. Right. I should say that's very important. So just looking at public health data and looking at public data is fine, but the moment you actually try and contact the user, then you absolutely need to go through the IRB. Yeah. Sorry. >>: I'm just saying I guess even just high-level developing interventions and not actually even contacting. >> Mark Dredze: Yeah. So I will say that -- I will say -- nothing you are saying is wrong. There are different levels of comfort that different IRBs will have. So I can tell you about the experience I have with the -- not just Johns Hopkins IRB, the Johns Hopkins Homewood IRB which is what I -- which is basically the specific IRB that covers me in the School of Engineering. There are three other -- well, there are many other IRBs at Hopkins, and I know some of them that have some different perspectives on this. I think it happens to be that all of them agree on the points I've said. But as we get to those different levels, different IRBs will interpret them differently. So Mechanical Turk, for example, which you guys know about it, I'm sure, so that was something that was very challenging for IRBs and not just from the IRB standpoint but from a hiring perspective. Right? Because when we said we want to go -- you know, we submit a receipt and they say what's this for, well, we're paying people for work. Right? And it's like you're paying to work, are these people on payroll? Who are they? And like, oh, no, no. But we're paying them a dollar an hour. Right? And we're not paying taxes on it. Right? So different universities might know about some of these challenges. Different universities had very different reactions to this. And it took actually a couple years for universities to actually fall in line in terms of what was reasonable here. But some universities basically said you can't use Mechanical Turk because you can't pay people that little to do this work. Hopkins is fine with the payment. The IRB, they're a little concerned about. But it depends what you're asking them to do. So if you're studying -- it's very nuanced. If you're studying what they do or how they do it, that is studying a population and requires IRB approval. If you're having them do things where you don't care about how they do it or what they're doing but you just need something labeled maybe, then that's not a human-level experiment. You don't need IRB approval. And that's a very, I mean, difficult line to walk. And I've had students who have done things that the IRB has said no, that's definitely a human study, and the student's like what are you talking about. So I think the same thing is happening with Twitter data. IRBs are trying to wrap their heads around this and figure it out. The same thing is also happening with clinical data, by the way, because the sorts of things that -- I work in clinical data as well. The sorts of things I want to do with clinical data, when I talk to IRB people, it blows their mind that I'm even asking to do this, not because of the things I'm doing are in terms of the -- you know, I'm doing unethical surgeries or anything like that, but just the amount of data I want on populations is way more than they're normally comfortable with. And so I think IRBs are trying to figure out what to do because there is so much value in that. And different IRBs have found different policies they want to do and different security protocols and all this. I think right now it's a big mess and people are trying to figure it out, but I think there is a tendency towards figuring out ways to allow this sort of research. So this is a much more expansive answer, but that's my not-one-hour talk on the topic. Anything else? All right. So thank you, guys, very much. If you are watching online, e-mail me questions. And if you were too embarrassed to get recorded and ask me a question, you can come over and ask me. I'll take it off the mic. Thank you very much. [applause]