>> Eric Horvitz: So it's an honor today to... about opportunities from social media data for public health.

advertisement
>> Eric Horvitz: So it's an honor today to have Mark Dredze with us who will be talking
about opportunities from social media data for public health.
Mark is a research professor in computer science at Johns Hopkins University and a
research scientist at the Human Language Technology Center of Excellence. He's always
working with the Center for Language and Speech Processing and the Center for
Population Health Information Technology.
His research is in natural language processing and machine learning, and he's done work
in graphical models; semi-supervised learning; information extraction; large-scale
learning, which I guess is different than regular machine learning; and speech processing.
And we're very excited about the work he's been doing in social media and public health,
probably the closest collaborator in terms of shared vision with some of the work we're
doing in this space at Microsoft Research here.
As in some background, Mark did his Ph.D. work at University of Pennsylvania and has
been blasting at Johns Hopkins ever since. And we were also fortunate to have one of his
students with us last summer, Michael Paul, so it's -- hope to keep the collaborations and
sharing of ideas and maybe even great minds going here.
>> Mark Dredze: Thank you very much. So I was last here 11 years ago when I was an
intern, so thanks for having me back. Hopefully I will not take that long to come back
again.
I was in the base class libraries team of the .NET Framework. You guys know what
.NET Framework is, right? Okay. When I got here, I had no idea, and like -- because it
was still really new and they had to explain like .NET on the Internet is a different thing
and stuff. Anyway. So I was in that team. But we were actually not working at all on
that stuff. We were the FxCop team. Does anyone -- is that still around? Does anyone
know what that is? It's a static code analysis tool. Basically you would give it your code,
it would run it through and say like, you know, you should be using string builder here
instead of string concatenation and this is -- this is -- I don't even know what else it did.
Did a bunch of things like that.
>>: It's tough because you came out of MSR [inaudible] PPRC.
>> Mark Dredze: Right. So my job was to be that bridge ->>: Cool.
>> Mark Dredze: -- when I was here. I don't know if it still exists. Anyway, so that's
what I did. I'm now doing different things, as you can see.
So what I -- not that I didn't love what I did here, it's just I've done other things. So I'm
talking about social media, but the definition of social media on this first slide is way
more broad than it probably should be, and I'm going to actually talk about what I would
call Web data in general.
So let's first by talking about public health, which is where I'm especially focused. So
public health is the prevention of disease, prolonging of life, and promotion of health in
general. And for those of you who are seeing public health for the first time, these are the
sorts of things that public health works on -- disease surveillance, study how people
self-medicate, illness, vaccinations, drug use -- and here I mean illicit drug use, or
recreational drug use -- tobacco use, educating people about health issues.
These are all areas I actually work on within public health, but I picked them to show you
kind of the breadth of things that public health is focused on.
And in public health, if you've ever taken a class in it, you'll see that there's a very
complicated nine- or ten-step cycle that I've summarized here in a two-step cycle just
because that's really the level we need to care about today.
You have population, which is everyone here, right? And then you have doctor, which I
guess is Eric, maybe no one else. But also public health professionals, and not everyone
has an M.D. in this field, there's Ph.D.s too, so you all can be included.
And basically these two groups interact in the following way. There's surveillance, and
surveillance just means information about the population going to these people.
Surveillance sometimes has maybe negative implications. Here I really mean is just
we're looking at what information we can get out of the population to study the health of
that population.
Then the doctors kind of think about it for a while, and then they develop interventions
which are things that they can do to promote health, reduce disease in the population, and
then they survey those interventions and repeat. And that's how public health works at a
very high level.
So I said before that public health is really about improving health and quality of life and
population, disease, all these things. In order to do this effectively, you need data. You
need data on the population. And that's really a big, big challenge of public health is how
do we get this data.
So traditionally this data comes from two sources, surveys and clinical visits. So surveys
are we either go door to door or we call you on the phone and we say, you know, have
you seen a doctor lately, do you have a primary care physician, do you suffer from
asthma, are you a smoker. We actually do these things. These are CDC-funded studies,
as well as other institutions. That's one way we get information.
The other is clinical visits. So we go to doctors and we say how many of your patients
are smokers, how many of your patients had this disease. There's certain illnesses that
are mandatory reporting illnesses. If you show up with certain very rare illnesses at a
hospital, that hospital has to report back to the state health agency that there was an
outbreak of this illness.
So that's the normal place we get data from. There's some less-known nontraditional
mechanisms. For example, we sometimes sample wastewater coming out of prisons and
sometimes cities to see what drugs are being taken and things like that. But that's a more
niche, let's say, method.
So these are really the data that people use. And this is really limited to the sort of
research we can do. Because you can imagine, if you know anything about these two
data sources, there's a lot of questions that you might want to ask that you cannot ask
using these two methods or that's very difficult to ask.
So along come social media or Web data in general. This slide was made last night, so
there's a couple new social media companies that have it included, as you can imagine. I
don't know if Snapshot is on here. That's the new thing, right?
So social media has a tremendous amount of information in it, and you really hear Web
data in general. People talk about politics, sports, entertainment, what they do for a
living. They talk about what they ate for breakfast. And critically for our talk today they
talk about health.
And so that means here that we have an opportunity to look at social media which really
is a reflection of the ongoing lives of people, right? People kind of tweet or write
Facebook or do searches about the sort of things they do on a daily basis. And because
health is a part of that daily-basis life for people, we can see part of that in this data. And
so that has tremendous implications for both facets of this public health cycle.
So in terms of the surveillance aspect, it means that we can do things that we already do
in a better, faster, and cheaper way. And that's very valuable unto itself because there's a
lot of things that we want to do that we do but they're slow to get results. They might
take us a whole year to do the survey, for example. We can do those things faster.
Really exciting, though, are new opportunities, things that we could never do before that
we now can do using this data. And I have examples of both of that today.
I don't have examples of intervention because I don't really focus on intervention. But I
want to just tell you this does happen. So these are things like identifying people for
communication, so you identify who you want to intervene with. Entailing messages
specifically for them. So I'll give you one example of this.
In Chicago right now there is a group that is looking through Twitter for people who
mention that they got food poisoning at restaurants. And when they see that, they send
those people a link to the public health departments forum to report on restaurant food
poisonings. Right? And so this is a way where they're basically realizing that people
mention these events, they're not being reported to the public health department and
they're intervening with them to say, hey, can you give us that information, we'll go look
at that restaurant for you.
All right. So I'm really going to talk about the examples here under surveillance. And
I'm going to talk about three types of data. Search logs, which you guys know about
really well, and these are from the level I access them where we're looking just at trends,
we really are able to get coarse trends, the sort of work you guys can do here where you
have access to logs, you can do much more fine-grained stuff. But I won't be talking
about that today.
Social media, which is really very good for shallow content analysis. So you can't really
go very deep into a single message, because these are very short messages, but you often
gain a lot of information out of them nonetheless.
And then Web forums, which are really good not for doing trend-level stuff but for really
doing focused deep knowledge extraction. And I'll give examples of every one of these
today.
Any questions so far? Anyone want to argue about something I said? Those of you
watching online cannot. You can send me an e-mail and complain. But you have to wait
till the end for my e-mail address.
So let's start with talking about search logs. So here's a paper that we just published
recently which is one example of the sorts of questions that we can now ask using this
data. So how are economic health and physical health related on a population level.
And what I mean by this is when there's a recession, do people have negative,
presumably negative, health outcomes because of that recession. We know that the
recession affects a lot of things besides jobs. For example, the divorce rate is heavily
influenced by recessions. You can talk to me later about why that is. Not in the way
you'd think, actually. But we want to ask here does a recession increase, for example,
stress, physical pain, those sorts of things.
And the difficulty here is getting the data to do this sort of study. Right? We need a
large population, we need a long span of time. We want to compare things before and
during or after recession. And we have questions about many different ailments. You
know, I'm just saying health in general. There's a lot of different questions we might
want to ask people about health.
So we're going to get data from Google Trends. Which, if you goes know this, is
basically Google publishes trend data on their kind of most popular queries. I don't
remember how many of the most popular queries. But if a query has a certain number of
million people searching it, they will post trends. And this is mostly for doing things like
realizing that Justin Bieber is really popular. Again, I thought he had gone away. But
that's kind of the idea behind Google Trends. But we're going to use it for health.
So what we did is we looked at 343 queries that we identified using the couple seed
queries then Google kind of recommends additional queries for you to look at, and then
we hand went through them and picked out the ones that actually are reflective of health.
And we looked at the top 100 of those queries that increased during December 2008 to
the end of 2011, which was the great recession. And we're going to factor in like overall
search traffic volumes and other things. You have to get that right so you're not chasing
ghosts in the data.
So what were the biggest increases? So we found headache-related queries went up 41
percent; hernia related, 37. That's the best hernia picture I could fine. Chest pain-related,
35 percent. And when I say related, I mean a couple different queries all searching about
headache related items.
And then the single query that had the biggest spike was stomach/ulcer symptoms, which
went up 228 percent during the time period.
So let me show you what this actually looks like. Every one of these lines is one of those
100 queries. So you see any given query has a lot of variance to it. But if you look at the
trend over time, here this gray area here is the recession. So you can see that before the
recession, well, you know, it kind of goes up and down a little bit over time. But there's
this really noticeable uptick during the recession here. So I'll just show you one query -or one set of queries, headache symptoms. Actually, this might be just the query,
headache symptoms. I'm not -- I don't remember offhand.
And you can see that it kind of varies quite a bit. This is a linear fit to the data before the
recession started. So you see a slight decrease in the slope. I think basically the way to
read that is it's going to continue on. I don't think the slope really should be decreasing
slightly.
But you can see that the increase here is clearly very different from what was going on.
And this is where the recession -- this is where we said December 2008. The recession
actually starts a little bit before. Whatever. It depends how you measure the recession.
So that's just an example of the sort of population level questions we can ask just by
looking at what people search for. And you guys know here the work that Eric and others
is doing -- are doing, are asking really quite interesting sophisticated questions on this
data and getting an answer that we normally would never have access to. And that's
really quite exciting. Yeah.
>>: So a question. So there's actually a pretty big increase from 2008 to 2009.
>> Mark Dredze: Oh, you mean right in here?
>>: Yeah, just ->> Mark Dredze: So ->>: [inaudible] little bit earlier, right? I'm just wondering did you guys look to see if that
was -- do you see that type of thing and might it be predictive?
>> Mark Dredze: So that's a good question. I don't know. So, first of all, when the
recession starts is a little bit difficult. So I don't remember the dates of like Lehman
Brothers collapse and all that, but I remember September 2008 it became apparent there
was something going on. We just probably -- let's see, June, September -- it's like right
around here. I don't know exactly why we use December as the cutoff. It might be that
we're basing it on some other like official recession criteria. I don't really remember.
But even if you just know kind of the history, you can see that something has changed
around here from what happened before. And then any one query or even a small set of
queries is very hard to generalize from because there's a lot of things that can be going
on.
And so if you look at the overall trend in this line here, that's when it really convinces us
something's going on. We have statistical significance numbers and all that in the paper
and such that you can look at. I'm giving you just kind of the general plots here.
Okay. Yeah, you have to -- I mean, with like anything, you have to be careful not to
overfit too much. So we try and make some really simple general statements we can
make about the data.
All right. So that's all I wanted to show you on search logs. Let's talk about social
media. I'll speak about this for a little while longer.
So I'm really talking about Twitter data here, although we're looking at a bunch of other
social media sites as well. But Twitter is really the easiest to work with. We don't have
firehose access, but even just using the 1 percent API you can get a lot of good data.
So the key thing here is we're getting -- I'm sorry, no, we're not getting this. There are
500 million messages a day, so that's a lot of data to work with. Okay. And I assume
everyone here knows what Twitter is.
So the reason we're using it is not just because of the size of the data set, it's because how
it really does show up in this dataset.
So the first tweet here, nothing like waiting in line to buy cigarettes behind a guy in a
business suit buying gasoline with $10 in dimes.
So this person is not saying I am a smoker, which is what I actually care about, but
they're saying something that actually indicates that they do smoke or they're probably a
smoker. Obviously there's a million reasons why -- well, there's not a million reasons.
There's a couple reasons why you could buying cigarettes, but it's not a high entropy
distribution over those reasons, so this person's probably a smoker.
And the point is we see a lot of tweets like this that are indicative of health and we can
glean health information from, even though the person is not directly trying to report
health to us. And actually we're starting some new work on tobacco use, and so these
tweets are exactly the sort of thing we're going to be looking for.
So let me just give you some examples of the breadth of things that we've done. So we've
looked at medication use, how people -- for example, seeing that people use Benadryl to
treat insomnia in addition to allergies, looking at patient safety issues, so these are people
reporting that their doctor has made a mistake, either surgical error or prescription error.
Mental health, this is some new work on posttraumaticic stress disorder. So there's really
a huge diversity of things you can do with this data.
I'm specifically going to talk about disease surveillance today, which is the work that's
gotten the most attention in the space of Twitter health stuff. So I'm going to tell you a
little bit about what we're doing with disease surveillance.
So last year many of you remember was a really unique year when it comes to the flu.
Not unique once in a hundred years, but unique once in a couple of years, once in a
decade, where there really was a very large flu outbreak in the United States. It reached
epidemic levels. And you might have known this not just because you were sick or your
friends were sick but because there was a large covering of this event in the media. This
is a New York Times article from the beginning of 2013 where they're basically saying,
you know, this is a really big flu year.
So this received a lot of attention. The reason it receives a lot of attention is not just
because flu is inconvenience, but flu is very dangerous. Most people who get the flu
recover and they miss a couple days of work and they're fine, but so many people get the
flu and there is a percentage of them that have serious health implications of that
including death. And the flu kills a lot of people in the United States every year. I
don't -- I don't have the statistics offhand, but it predominantly affects the elderly and the
young. Okay.
One of the reasons swine flu was so big is because that distribution who -- over who it
affects was skewed and affected kind of normally what we would consider healthy people
that aren't affected by the flu to that degree.
Anyway, so knowing what's going on in the flu is actually a really serious health concern
in the United States. Because of that, the CDC invests a lot of time and effort into this
problem, and they have a really good influenza surveillance network. It's called Flu
View, at least that's what they call it when they post the information online.
This is a nationwide surveillance network. It has 2700 outpatient centers, which is
hospitals as well as doctors offices. They're reporting ILI, which means influenza-like
illness. What that means is you go into the doctor and the doctor says, well, you have a
fever, you have the chills; I bet you have the flu. Go home and get some rest. Right?
Or, here, take some flu medicine. But there's no necessarily -- there's not necessarily a
confirmation that for sure you have the flu. And that's what you expect, right? The
doctor is not going to order a lab test every time you go in with the flu. Okay. So that's
what gets reported.
So I mentioned that in detail just so you understand that the numbers that we're going to
talk about as a gold standard here from the CDC are by no means a gold standard. Right?
Not everyone there has the flu. And 2700 clinics is a lot, but it doesn't give you a perfect
reading into what's going on in the United States. All right.
So while this does give you very good data, the major cons are it's slow. It takes them
about two weeks to collect this information and to publish it. And so the flu rates that get
published every Friday are two weeks old. And they also get updated over time. So they
revise those estimates as more and more reports come in, and we've recently realized they
can revise them quite a bit.
And there's also varying levels of geographic granularity. So they're looking at a national
level. They also look by region. There's ten regions in the United States they divide up
into. But they don't look at the C level because they just don't have enough information
to look at that.
So Twitter is a very attractive source for providing new data for influenza surveillance.
Because we can do things real time. Right? I can analyze suites in real time and tell you
what the flu rate is today. And also it has the potential of geographic specificity.
So the questions are: Can we do this accurately, that will be our first question; can we do
it with geographic specificity; and can we really do it in real time. And what I mean by
this is there's a lot of people who will take these -- take a dataset and take a method for
doing flu surveillance and show you that for last year and two years ago they could do a
really good job.
And so the results I'm going to present you today on the 2012-'13 flu season, we actually
built the season -- we built our predictive system before the season started and then ran it
on the season and then published the result at the end.
If you don't believe me, the NAACL submission deadline where we published it was in I
think December 2012. So that's when we built the system, and it was before the flu
season was really in gear. Okay.
All right. So the first thing is can we find flu on Twitter. And if you just look for the
keyword flu, you can find a lot of examples of it, but they're probably not what you're
looking for.
So here are some of my favorites. The Ray Lewis flu. Does anyone know what the Ray
Lewis flu is? Man who lived in Baltimore, you don't know the Ray Lewis -- the Ray
Lewis flu is play -- so does everyone know who Ray Lewis is? Okay. Ray Lewis -- I
know, you guys Seahawks fans?
>>: [inaudible].
>> Mark Dredze: You know? Okay. That's right. Congratulations, by the way. Okay.
Did you know who he was before the Super Bowl was the question anyway.
So the Ray Lewis -- so basically Ray Lewis was the -- a line -- he was a linebacker for
the Ravens, and he was particularly ferocious and players would get physically ill the
week before playing him in anticipation of what we would do. It's called the Ray Lewis
flu. Okay. Anyway, so you can, whatever, read the article. Does anyone know what
swag flu is?
>>: We call it the Sherman syndrome here. [inaudible] swag?
>> Mark Dredze: Swag flu. Does anyone know what the swag flu is? So you should
look it up in Urban Dictionary. It's really quite entertaining. The swag flu -- oh, I have it
written down here. A contagious virus that spreads game, confidence, and swagger
among a population of individuals.
So while this might be contagious, this is not really what we're thinking of when we
talked about flu. But if you actually just find tweets with the word flu, you will find all
of these as well as others that I'm not pointing out. Right?
And that's not really what we want. So this is not just a cute problem, it really is a
problem. So here's a plot of flu-related keywords -- I'm sorry, this is not actually
keywords. It doesn't matter where we get those. These are flu-related tweets for the
summer of 2009. In the summer there should not be much flu. But you see there's a
pretty significant increase right here early June. Does anyone remember what was going
on in the world of health in the summer 2009?
>>: Mad cow?
>> Mark Dredze: Mad cow? No. I mean, that's always happening in some degree, but...
so don't feel too bad. I gave this talk to a public health department -- not avian flu.
You're very close. I gave this talk to a public health department; no one there knew
either, so don't feel so bad.
>>: [inaudible].
>> Mark Dredze: It was swine flu. Swine flu really broke -- spring 2009 was when it
came out, and then everyone was getting ready for the fall of 2009 when it was going to
be big. So the summer of 2009, that's what was going on. And we have a blip here. And
actually what that blip is is the World Health Organization announced the beginning of
June, June 11th, that swine flu is going to be a pandemic. So our system identifies an
increase in flu when there really is no increase in flu there. All that's going on is that
more people on Twitter are talking about flu because this press release came out.
So this is not just you get cute examples, but your system is really misled by things if you
don't do a good job of identifying exactly the tweets you want.
So what we do is we're going to use some statistical classifiers and some NLP features
and [inaudible] speech tag and all this good stuff and we're going to have a three-stage
system here.
The first stage is we identify tweets that are health related, right? So we have basically a
set of health keywords that we filter on, and then even within those keywords we have a
classifier that says is this related to health or not.
We then take the health-related tweets and say is this about flu or not, and specifically
influenza.
And the third thing we do is we say is this tweet showing an awareness of the flu or is it
actually showing that this person is infected with the flu. So I'll give you examples of
that.
So many Americans seem to have bad flu right now. I'm worried this trend will reach
New Zealand in winter. I might need to step up my reclusiveness.
So this person doesn't have the flu. They are expressing awareness of the flu. That's very
different from my flu, cough and fever is getting worse. I'm worried. This person has a
flu infection. That's what this last stage tells the difference between, and that turns out to
be really important.
So we're going to use these classifiers, and we're going to find lots of examples of these
tweets. The next question is how do we identify geographically where this is happening.
So Twitter gives us geo tags, which is great. And we'd like to use those in order to geo
tag the data we have. There is now -- in addition to our work, there's one other paper that
I know of that is doing nonnational measurements from Twitter data of the flu, right, and
it came out the same time as ours.
So before that the real focus was on national, and remember when I told you about the
CDC data is one of the weaknesses there is looking at local data. So we really want to
use Twitter to try and look at local data, not just national data.
So can we use Twitter to attract these local trends. So is it accurate enough, you know,
do we have enough data, is it accurate enough for these finer-grained locations and is
there enough data.
So only this number is actually a little higher now. In the U.S. it's about 3 percent of
tweets are geo coded. But that's a tiny fraction of what we actually want to use.
So what we did is we looked at profile information, and it turns out that people can
specify strings that are indicative of location. So this person Allison says they're in New
York, this person Ashley says they're in Florida. Here we have someone says they're
Arizona. So we can extract some information about location just by looking at profile
information even if the person isn't geo tagging their tweets.
We also have more challenging cases like New York, which our system can do. There's
many variations of New York with Es and Os in all sorts places. Does anyone know
where that is?
>>: New Jersey.
>> Mark Dredze: Right, very good. Our system doesn't know that. You could imagine
doing it with a phonetic model of language and such, but that's a little too much. So
we're going to pass up that one. And you can see the influence Justin Bieber has on
tweeter, if you weren't aware.
So we built a very simple system called Carmen, where Carmen takes those strings and
resolves them to one of about 4,000 locations that it knows about. And just we have a
whole paper on this if you're curious, but the one number you have to know is we go
from about a dataset with 1 percent geolocated tweets to 22 percent geolocated tweets.
And you can use our code. It's available in Python and Java. It's online. You can extend
it, whatever you want. But it's available. So we use this to get a lot more data than the
geo tag data that we're provided with.
So the last question is can -- does this actually work. So let me show you some historical
data first. So this is 2009, 2011. So 2009 was a really easy year to predict. Because of
the way swine flu happened, there was a huge uptick and then a huge drop. And
basically as long as you predict the flu goes up and then down, you're going to do a very
good job of predicting 2009.
And so a keyword-based system, which is just these are using keywords, so health and
human services, they ran a competition a couple years ago, but they're things like flu,
influenza, things like that. This is our flu classifier which is just is this about flu, not is it
about infection. And this is the Google Flu Trends data. If you know Google Flu Trends,
great. If you don't, they're using queries in order to predict the flu rate, and we can talk
about that later, how they do it.
So these look really similar. But in 2011 it was a very mild season, so it wasn't kind of
this huge increase and decrease. And therefore a lot of systems really struggled.
So here is how our system did where we're actually just looking at infection tweets
nationally. You could see it does a little bit better in 2009, but who cares. In 2011 it cuts
the error in half between the best Twitter method and what Google Flu Trends has.
So really substantial improvements, but hard to do for 2012/2013. So this is we built the
system and then we ran it on data as it was happening. So the black line here is the CDC
data. The blue line here is the infection data. That's our system. And then the dotted line
is just our influenza tweets, not filtered down by infection.
So you can see basic, I mean, well, it's quite obvious, the solid blue line is much better
according to the black line. The dash line really is two points that seem to stand out as
wrong. One is this point here that's fairly obvious. So that's the early January media
coverage I mentioned. And then this point here, which is actually when the CDC made
an announcement that there was a flu epidemic, so this is the CDC announced it, and then
a month later CNN said, oh, wow, we should do a story on it, and then New York Times
and everyone did a story.
So this is -- these two increases are really just people talking about flu because it's in the
news, but they're not actually saying -- they're not actually saying that they're infected.
So I don't have a great sense of other systems. I can tell you that about this point in the
season a couple other people published numbers of how well they were doing, and they
were like .66. Oh, our number is .93. So we do really, really well. Other people around
here were doing a little bit worse.
I don't remember how Google did, but Google actually suffered a little bit of this
problem. And there was a really good article in I think Science News about Google Flu
Trends really overestimating. And Google actually published a paper recently where
they said why that was a problem and how they were fixing it.
So we're actually doing very similar to how Google is doing on this task.
>>: Question.
>> Mark Dredze: Yes.
>>: The correlations for template concurrent time periods.
>> Mark Dredze: Yeah.
>>: [inaudible] this is every week?
>> Mark Dredze: So we're looking at a week by week. So each data point is one week.
>>: Ah. Okay.
>> Mark Dredze: So I'll just say one other thing because you said -- you said something
that made me think of this. So we then took these numbers and we chose not to believe
them and tried to do all the statistical analyses we could to show this is not actually a
significant improvement.
So we looked auto correlations, cross correlations, shifting things. The first differential
of the -- I don't remember what they're called. We had some statistician friends working
with us. And those are all published in our [inaudible] paper, and after doing all of that
we still are doing a really good job of predicting the flu.
And then the other number I'll just throw out there is if we just do a really simple system
which says look at the weeks where there was big difference, like an above average
difference in the flu rate, compared to the previous week that came before it and ask our
system is this difference going up our down, our system got a hundred percent. Right?
So that's what I think is a more useful number. When someone looks at this, they're not
going to care necessarily about these little points here about exactly what it was. What
they want to know is is this week worse or better. And on that question we did as well as
you can.
Okay. So the next thing we did was we want to see if this was going to be useful at a
local level. So we looked at New York City. The reason we looked at New York City is
they're one of the few if only public health departments for cities that publish data online.
There are other cities that have this data, but it's either not nearly as good as New York
City or they don't publish online.
Apparently New York had a mayor that was really into public health. Anyone follows
New York politics? So he invested a lot of resources in that city, and they have a really
great public health infrastructure as a result. So they have numbers like this.
So they actually published some of the numbers online ->>: They were also invested in John Hopkins.
>> Mark Dredze: He also gave us over a billion dollars [inaudible]. He's a good guy. I
like him. You can put that on the Web.
So they do not publish numbers we actually care about, though, which is doing counts
and not percentages. The difference is not important here. But what this allowed us to
do is a blind study.
So rather than downloading the data and running a correlation, which you might not
believe us, we sent them our predictions and we asked them to do the correlation for us
based on their data. So if you don't believe me about the first thing, you might think I'm
lying about that too, but you can call them and they'll verify it.
So this is what they told us how we were doing. Our faction curve was .88, our keywords
are .72, comparing that to the national curve, it's -- it's barely statistically significantly
different. So that means we're almost doing as well just on New York City as we would
on the national trends.
And that's very encouraging because it means that if we can actually predict things in a
local way, then we can use the system to help enable local decisions about how to
respond to influenza. And really a lot of these decisions are made on a local basis.
Knowing the national rate is helpful, but if you need to close a school district or run a
vaccination campaign, you want to know that in a local level.
So just the last thing, let's go back to this curve. So, remember, this curve now I can tell
you is just influenza-identified tweets using a classifier. This curve is infection tweets.
And you can see the blips disappear. And in the summer the infection curve doesn't
notice any difference. There's no change in the infection rate.
So I'll show you this. This is one of the first times I'm talking about this. We really want
to give public health researchers as much of the data I've shown you here as possible to
enable them to make decisions. So we're building a Web site that will do exactly that.
And this is not available yet. If you're interested in it, send me an e-mail, and I can share
it with you when we go into our private beta. But we're really looking to this sort of
platform to enable public health researchers to get direct access to our data, not as a
published paper the year after, but in real time to help them make these decisions.
Okay. So that's what I wanted to say about social media. The last thing I wanted to talk
about today is Web forums. And this is an area where a couple people in the field are
working in, but not nearly as many as Twitter. And I really think there's a tremendous
value here that people have been overlooking.
So we've done a number of things here. We've looked at quality of care based on what
people think of their doctors, looking at doctor ratings, for example.
We're looking at prescription drug use, which is something you guys have done as well.
But what I really want to talk about today is something that I think is the most exciting.
We're just talking about elicit drug use. And hopefully the money next to the pills give
you the elicit feeling and not just that this is someone's prescription.
Well, it might be someone's prescription, but not the person taking it necessarily.
All right. So ->>: Just tells us it's the U.S. as opposed to Canada.
>> Mark Dredze: It tells you it's the U.S. Right. Yes. I know most about the elicit drug
market of the United States. That made this sound weird. Okay. I think you all know
what I'm trying to say. All right. Very good.
So let me first tell you the dataset that we're using. So we're using a site called
Drugs-Forum. If you've never heard of this site, it's the sort of thing that you really can't
believe exists.
So I'll tell you what they say they are. Drugs-Forum is an information hub of high
standards in a platform where people can freely discuss recreational drugs in a mature,
intelligent manner. Their words; not mine. You might disagree on the mature, intelligent
manner of the site.
Drugs-Forum offers a wealth of quality information and discussion of drug-related
politics in addition to assistance for members struggling with addition. So this last part
may be true, but this part, freely discuss recreation drugs, really dominates the site.
That's why people go there. They don't go there to talk about policy; they go there
because they're talking about how to get high.
So we have 100,000 messages, 20,000 users. We have self-reported information of those
users. So the user of the site, skew male, that is not because recreational drugs users
skew male in that degree, it's just because of who's using this Web site. So we know
that's not true by other data.
50 percent of these people say they live in the United States. And then this is the age
breakdown of the younger groups. It goes up higher. I do not believe this number at all
because it says that 58 percent of the site is between 20 and 29. You're not allowed to
use this Web site if you're under the age of 18, and I do not believe that means that those
under the age of 18 choose not to use it; I think they just lie about their age and they will
probably just select this box.
So I do not trust these numbers. These numbers I kind of trust a little bit more.
All right. So why do we want to use this data? So we're really interested in Web-based
drug research. And to help you understand why, I need to tell you a little bit about what's
going on in the world of elicit drugs.
So if you don't know anything about this, you might assume that the world of drugs is
cocaine and heroin and LSD and meth. Anyone else want to throw one out there?
Always interested to see who volunteers which drug. Okay. Let's say those. We're all
comfortable with those. Everyone's heard of that. Those are the big ones. Right? And
that's when you hear about drug addition. That's what we often here about.
What's going on now, though, is that there's new synthetic drugs that are coming to
market at an ever-increasing pace. In 2011 [inaudible] recorded 52 new drugs in the
year. So it's an average of one a week. I don't know if they were spaced out like that.
And what's going on is that people find some drug, right, they tinker with it in a lab, they
have a new chemical compound, and they produce a new drug. And they do this because
they're experimenting to try and come up with the next big thing. They're also doing this
in response to policy. So once we make one drug illegal, if you go back to the lab and
you can tinker with it so you have a different drug, right, you can then market that again
as something else.
Because these are new drugs, they're not illegal when they come to market. And so you
could often legally purchase these things. And they often have names that try and mask
what's supposed to be going on so that people don't realize what they're for. So they're
often called incense, spice is another name. Has anyone heard of bath salts? So bath
salts are not bath -- like they're not for baths. Okay? They are called that because of how
they look, but it's completely unrelated. It's another drug. Does anyone know why bath
salts are really popular all of a sudden, in the past year or so? You nodded. Do you
know why?
>>: Yeah, I'm assuming it's really cheap and easy to manufacture.
>> Mark Dredze: Cheap -- those things are true. But that's not why. A lot of people
now have heard of it that didn't before. If you go look at your logs, you'll see a big spike
in bath salt queries.
So about a year ago there was an article on the man in Florida who turned into a cannibal
because he took bath salts. Does anyone remember this?
>>: Yes.
>> Mark Dredze: Yeah. Oh, yeah. So that's not actually true. He didn't take bath salts.
It was misreporting. But everyone heard about bath salts after that. And so bath salts is
the example of something that about a year ago really got the attention of a lot of people
in the drug enforcement -- and then now I believe they're illegal -- but the drug addiction
communities as this big new thing, even though we have posts on it that go back five, six
years.
So what's going on is you have these new things being introduced, and they largely fly
under the radar because they're new and not a lot of people know about them. But the
drug users certainly know about them.
And this poses a really big problem for doctors. So if someone shows up in the ER and
they've overdosed, right, you want to know what they took and how much. And if they
can't tell you or you never heard of the drug before and you don't know how much of that
drug is a normal amount, then it's very difficult to do anything.
The same is true of people who work with addiction, right, they work with teenagers who
are fighting addiction, and they never heard of the drug before, it's very hard to try and
help that person.
So there's a huge need for information and really not a good place for this information to
come from.
So the sort of questions we might have about a drug like Salvia, and I'll get to Salvia in a
minute, how does this drug vary by demographic group, in other words, who's taking it,
who's using it, what are the effects, what are the dosages of Salvia. Has anyone heard of
Salvia? It's a little more popular of a drug. Okay. Why -- do you guys know why Salvia
is popular? Not cannibalism.
>>: Because it's legal, right?
>> Mark Dredze: It's not -- I don't think it's legal anymore. But it was for a while, you're
right. But what made it really popular? Do you know why? So if you go on YouTube
and you search for videos, there are a lot of videos of high schoolers taking Salvia and
then recording themselves with webcams and uploading it to YouTube. It was like a
whole big thing. But this got started ->>: Hanging around with the wrong crowd.
>> Mark Dredze: You got to look at YouTube. YouTube is the place. I'm not
exaggerating. YouTube has tons of drug information. There are people who study
YouTube for that reason.
So Miley Cyrus, there's a video of her on a birthday party ->>: Yes.
>> Mark Dredze: You know this?
>>: We're hearing about this, the controversy with her and her drug use.
>> Mark Dredze: Yes. So there's a picture of her -- not a picture. There's a video of her
on YouTube where she is smoking Salvia, and that was the big revelation to a lot of high
school kids of, oh, there's this drug called Salvia, and actually really increased the usage.
That's why it became very popular, Miley Cyrus. I'm not going to play the video for you
guys. You guys will have to go see it for yourself. TMZ. It's from TMZ. That's the
place to -- anyway.
>>: [inaudible] my son [inaudible].
>> Mark Dredze: But, by the way, it is interesting to note, the second video here, it
recommends Justin Bieber smoking weed. So I point this out not just because it's funny
but because there really is a lot of information on YouTube about this, about what's going
on. It influences popularity. And people in the drug research community, public health,
they don't really have any of this on their radar.
So what we wanted to do is take all of these drug messages from these drug forums and
kind of summarize them in some way and present them in some way to people who
actually need to read them to learn about what's going on so that they don't have to sit
there and read through every one.
And because we know NLP and machine learning, we thought we could do something
interesting.
So we used a model called factorial LDA, which I'm just going to explain very briefly
here. It's a topic model, so if you're familiar with topic models like latent Dirichlet
allocation, it's the same idea. So word tokens are associated with not a topic, a latent
topic, but instead a set of latent variables. All right?
And that set of latent variables allows us to jointly model multiple things at once. For
example, we might be interested in modeling topic and perspective or sentiment, and
modeling these things all at the same time instead of a topic model where you do these
things individually as different topics.
So that means instead of a distribution over topics, we have a distribution over what we
call tuples, where each tuple has its own word distribution, just like each topic would.
Ignore the last line, because I don't have time to talk about that.
So we looked at a three-factored model for this dataset where the drugs -- we had 24
drugs -- you see, I could only name four offhand, but there's 24 in this dataset. So like
tobacco, amphetamines, alcohol showed up as well. So 24 drugs. That's one factor of
model. The second factor is the way it's delivered, and we just looked at injection, oral,
smoking, and snorting. There are a lot of other ways to get drugs in your system that we
didn't look at here.
And then we looked at five aspects of usage, the chemistry of the drug, the cultural
around it. And you see really interesting differences in the culture around drugs.
Looking at alcohol versus heroin, for example, you see really different words showing
up. The effects of that drug on a person, the health implications of using that drug, and
information on usage, like dose and preparation.
So what we're going to do is look at tuples of these things, like cannabis, oral, culture, for
example.
So the way factorial LDA does this, I mentioned it captures these tuples, here are some
examples of what it can learn for two tuples. Cocaine, snorting, health and cocaine,
snorting, usage. And you can see that health you have things like nose, blood, things like
that. So nosebleeds. Whereas the usage, coke, lines, cut, right? And even if you don't
know anything about cocaine, this should make sense to you.
And critically what the model is doing is it's saying this word list here and this word list
here, they should be similar in that they're both talking about cocaine and snorting. The
difference should only be in the influence of health versus usage.
The way we do that, right, is we don't want to actually learn every possible combination
of distribution here, because that would be too many to learn independently. So we need
to tie these distributions together in such a way that it encourages these two things to look
at similar as possible except that one difference.
And if you looked at a much longer list of the top words, you would see that they look
much similar, similar. We just look at the top ones.
So the way we do that is the system learns for each independent factor a list of words that
it thinks are useful. So here are general words about cannabis, general words about oral
use, and general words about chemistry, where the oral use is all drugs in all situations.
This is all chemistry words. Right? These are learned by our model.
And what the model is going to do is when it is interested in generating the list that
corresponds to cannabis, oral, and chemistry, it's going to combine these lists together to
give us a distribution that it thinks best captures all of these, right, using this log linear
model.
So this list here is what the model thinks cannabis, oral, and chemistry should use, this
distribution. That's actually too strong a constraint for our model because in actuality
there might be some combination of these words in practice that we can't quite model,
and so we want to allow it to deviate a little bit.
And so what we do is we actually sample a distribution using this as a prior, and the
posterior is going to be a little bit different, right? And the posterior is going to be
capturing the nuances of the combination of cannabis, oral, and chemistry in the data.
So take a look at this word list and think about cannabis, oral, and chemistry. What do
you think these messages all talk about?
>>: Baking products ->> Mark Dredze: Baking products ->>: Edibles.
>> Mark Dredze: Edibles [inaudible] right? So sure enough if we look through the data
and we say what are the messages that are talking about these three things and we've
learned these word lists from the data, we get all these messages about how to make pot
brownies.
All right. So let's go back to our example questions. We don't really actually care about
pot brownies so much. So for something like Salvia, we want to know how many people
are using Salvia, what are the effects of it. And so we're going to use our tool of f-LDA
as an extractive summarization system.
So what we're going to do is we're going to look at all sentences in our corpus and we're
going to pull out sentences that are the best examples of a specific tuple. So of the tuple
cannabis, oral, usage, we're going to look for the sentence that if we look at all the tokens
of that sentence and the probability of their assignment to the latent tuples, give us a
sentence that is the best example of that. So that's what I'm going to show you here.
So what we looked at, for example, is how people are using Salvia. So Salvia smoking
uses and Salvia oral usage. So these two sentences are best reflective of these two tuples,
and they only differ in that this one will be smoking and this one will be oral.
And here's what we get. The best way is to use a torch lighter, bong or pipe, bong
recommended -- not my recommendation, that's the site -- and hold in each hit 20 to 40
seconds. So it's clear inside that's smoking Salvia. Here, this is a little technical, but this
is about the dosage of orally ingesting Salvia.
>>: Yet don't use square brackets on that.
>> Mark Dredze: Right. That's -- that's -- that's verbatim from the text. I did not put in
recommended.
What are the effects of Salvia? So here the only difference is in smoking and oral. And
so you can see that this one, for example, is when chewed, the first effects are felt after
about 15 minutes. Blah, blah, blah. Whereas this one is talking more about, you know,
he took one large hit -- so again it's smoking -- and then held it in, laid back, blah, blah,
blah, orangish brownish light.
Okay. So actually I just point out one interesting thing about this data. It says he then
took one large. So who's he? Right? Who are they talking about? So because these
drugs are often legal, but not always, the site forbids discussions about you using drugs.
You're not allowed to say that you use drugs or where you got them.
So people have creative ways of expressing this information. So they're always talking
about their friend who uses drugs. So this person is talking about he, meaning his friend,
who took this drug, and then a very detailed description of what the friend had. But you
have other things like people write SWIM, which stands for someone who isn't me, so
SWIM took a shot of whatever last night. And then someone will reply SWIY, which
means someone who isn't you.
We also have much more creative things, like someone will write my pet rabbit Harry,
parentheses, 6'2" male, 250 pounds, went to a club with his girlfriend last night and did
the following things. So, yeah, I think it's an interesting artifact of the data.
So what we did is we showed these -- we did two things. One is we showed these to
colleagues of ours who are in the medical school and work on these, and they said these
look awesome, good job, and they gave us a pat on the back. But we can't publish based
on that.
So what we then did is we took these snippets and we took technical reports that were
written about these drugs, so we have a technical report that was written about Salvia,
and we excerpted the paragraph that talks about how do use Salvia if you smoke it.
And we showed that to a user, and then we showed them a bunch of these sentences, and
we said which of these sentences is most helpful in writing this tech report, this
paragraph.
And the sentences came from our models I discussed plus a couple baseline models. And
people picked the sentences from our model more often than anything else. And they
were saying if I had to write this tech report, which is exactly what these people are doing
in the school of medicine, among other things, our examples are going to be the most
helpful for writing those reports.
So we're really excited because we really think this means that this is a great tool for
mining large amounts of data and really finding the information that these public health
experts and our friends in the medical school really want to know.
So those are the examples I wanted to give you today from these three different areas.
There's a lot of people who went into making all this happen, and so I wanted to thank
them here.
If you guys are interested in other examples that I've kind of referenced but didn't go into
the details today, we have a Web site, socialmediaandhealthresearch -socialmediahealthresearch.org, which is the group at Johns Hopkins working on this, as
well as you can look at my own Web site and e-mail me. And specifically if you want
code and data, I'm always happy to share.
Thank you. I'm happy to take questions.
[applause]
>> Eric Horvitz: Any questions or comments?
>> Mark Dredze: I don't know anything about specific drugs, so -- and I'm being
recorded, so don't ask me those sorts of questions. Yes.
>>: What are your thoughts on one thing that comes up a lot is -- just thinking about the
world of Twitter and corresponding to CDC data, one thing that comes up a lot is, okay,
so they correlate it pretty highly but not perfectly so which one's right or more accurate,
it's along some dimension, so that's kind of a perpetual problem.
>> Mark Dredze: Yes.
>>: What -- I don't know, thoughts on that?
>> Mark Dredze: So obviously there's a lot of things we could track, so let's talk about
influenza as an example, because that's what I talked about today. So, first of all, there's
also the question of good enough. Right? So how much information do you need in
order to make decisions. And it really depends on what decision you're trying to make.
So there are a lot of things that people can do in response to influenza. So, for example,
something we're looking at right now is hospitals, when they know there's going to be a
big outbreak, more people come to the ER, so they need more doctors on call, they need
more staff there and they need more beds. So that's something that's not that hard to deal
with if you know it's coming.
So even if you have a fuzzy signal, it might be enough to make those kind of decisions
whereas if you don't and you're caught off guard, it can be a real problem because your
ER gets overwhelmed. You also don't want a lot of sick people sitting in your ER all the
time. ERs are usually not healthy places to be.
Whereas school closings, which happens -- so, for example, during the swine flu, a lot of
schools in New York were closed, other places as well. Because when schools are open,
parents send their kids, right? Like if you have kids, you kind of pray the kid's better and
you send them to school even though when the schools say 24 hours without fever before
you can send the kid back, right? But you're like, oh, we're routing up to get 24 hours
sometimes.
So they will close schools because schools are major spot of infection. So do that, you
might want to have much better information. Usually there they actually want to have
diagnosed cases in the school. There's a lot of things in between. There's vaccine
campaigns, there's running advertisements. So I see ads all the time in Baltimore that say
like stay home if you feel sick, which makes a big difference to other people getting sick.
So I think we're certainly at the level of accuracy that we can do many of those things,
but certainly not all of them. So that's number one. So that's in terms of good enough I
would say, like it's doing well.
In terms of the gold standard, that's really a problem here is that we don't have the gold
standard. You have this problem all over machine learning, right? So for Web search,
right, what is the right page to show as the number one result? And you often just don't
know. Sometimes you do navigational queries, you kind of [inaudible]. But oftentimes
you just don't know what the right answer is.
And so you don't really have a good gold standard to evaluate your answer. All you can
say is are people happier, are people -- you know, however we measure happiness,
through click-through rates, by advertising dollars, through the user studies, are people
happier with what we're doing.
And I think the same thing is true here. We don't really have a great gold standard. I
mean, there's the CDC data, but it's not perfect. So we have to look at a lot of different
things. We have to look for is this information useful, is it being delivered in a more
timely manner, can we do things with it.
So one of the things the CDC is doing this year is they're running a competition to see
who can predict influenza, predict meaning what's happening next week, not what's going
on right now.
So everyone was really excited about this, but we all said but we don't do prediction, we
do surveillance, so the CDC says, well, do prediction.
So our theory is that you can do a better job of prediction with this data, even though it
might not be as accurate, because it's more timely.
So even if there's not a gold standard or -- to evaluate against, there's still many ways that
we can see that the data is helping us to do a variety of things.
What else? I give very long answers. I can try and cut them down if you want. Okay.
So thank you -- yeah, go for it.
>>: So obviously you're looking at Twitter data ->> Mark Dredze: Yes.
>>: -- and you can track back to the [inaudible]. Has your IRB said anything about the
use of that data? Are they just considering it secondary data that's just public?
>> Mark Dredze: Yes.
>>: Are there any issues that they've -- that you've thought about or they are thinking
about?
>> Mark Dredze: Okay. So we could easily do an hour talk on this topic. And I'm not
saying that as a joke. I have colleagues in the bioethics department at Hopkins who
specialize in social media and they do give hour talks on the topic. I will try and give you
a short answer. So ->> Eric Horvitz: We have until 3:00, so just a very good topic.
>> Mark Dredze: Oh, really. Okay. Well, let me give you a short answer, and you can
follow up questions if you want.
>>: This is an important topic for us here.
>> Mark Dredze: Right. So let me just say a couple general comments. So one is some
people say we need to understand the difference between privacy and perceived privacy.
That's a big one.
So even though -- and you guys know all about this, right -- people have signed
agreements about how they're using their services to give you data, but it doesn't mean
they won't get really furious if you use the data in such a way. Right? I mean, if you
remember when Gmail came out people read the user agreement and got really furious
about what the user agreement said Google's doing, even though Google wasn't doing any
of those things. So those differences in perception and reality and perceived privacy and
real privacy, so that's something to keep in mind.
There is aggregation, right? So are we looking at individuals or are we aggregating over
populations, that makes a big difference to a lot of people.
When it comes to the IRB, there is an exemption for publically available data, right?
There's many exemptions, but that's the one that really is key for us. And every piece of
data we're using is publically available, freely to download from the Internet. Anyone
has access to it.
And that is enough of an exemption that the IRB has given us a blanket exemption for the
work.
If we were to do things like look at an individual user or publish things that we've
inferred about the user, right, I would definitely want to go back and talk to the IRB
about that, right? And that's more close to things that the IRB is very concerned about.
>>: But what about even general characteristics, like people can smoke Stevia, or
whatever it's pronounced, are likely to be between the ages 18 and 24 and live in cities.
>> Mark Dredze: Right. So my understanding right now is that ->>: [inaudible] you're not looking at ->> Mark Dredze: Exactly. So this is -- you know, it's the same question of what is
public health. So you guys all talk about what is public health. So one person is not
public health. A million people is public health. And the dividing line is somewhere in
between. Right? So it's the same thing with aggregate, right?
So looking at the whole U.S. population, everyone says -- everyone I think can agree is
aggregate. But as you overlay these different demographics on top of each other, which
is something we're definitely thinking about doing, you start getting down to the level of
one person, right? And as you get closer and closer to one person in a small group of
people, then you have to ask these questions of how meaningful is the aggregation. And
that's actually something that when we start to get to that point, that's another
conversation that we need to have with the IRB about those social issues.
>>: Right. And I think doing that segmentation is really important because say there's a
really big health problem with people doing whatever with bath salts, we know that it's
mostly like a senior in high school thing, that you can tailor your message and tailor how
to target people for interventions.
>> Mark Dredze: Yeah.
>>: Right?
>> Mark Dredze: When we get to interventions, those for sure need an IRB approval.
Right. I should say that's very important. So just looking at public health data and
looking at public data is fine, but the moment you actually try and contact the user, then
you absolutely need to go through the IRB. Yeah. Sorry.
>>: I'm just saying I guess even just high-level developing interventions and not actually
even contacting.
>> Mark Dredze: Yeah. So I will say that -- I will say -- nothing you are saying is
wrong. There are different levels of comfort that different IRBs will have. So I can tell
you about the experience I have with the -- not just Johns Hopkins IRB, the Johns
Hopkins Homewood IRB which is what I -- which is basically the specific IRB that
covers me in the School of Engineering. There are three other -- well, there are many
other IRBs at Hopkins, and I know some of them that have some different perspectives
on this.
I think it happens to be that all of them agree on the points I've said. But as we get to
those different levels, different IRBs will interpret them differently.
So Mechanical Turk, for example, which you guys know about it, I'm sure, so that was
something that was very challenging for IRBs and not just from the IRB standpoint but
from a hiring perspective. Right? Because when we said we want to go -- you know, we
submit a receipt and they say what's this for, well, we're paying people for work. Right?
And it's like you're paying to work, are these people on payroll? Who are they? And
like, oh, no, no. But we're paying them a dollar an hour. Right? And we're not paying
taxes on it. Right?
So different universities might know about some of these challenges. Different
universities had very different reactions to this. And it took actually a couple years for
universities to actually fall in line in terms of what was reasonable here. But some
universities basically said you can't use Mechanical Turk because you can't pay people
that little to do this work. Hopkins is fine with the payment. The IRB, they're a little
concerned about.
But it depends what you're asking them to do. So if you're studying -- it's very nuanced.
If you're studying what they do or how they do it, that is studying a population and
requires IRB approval. If you're having them do things where you don't care about how
they do it or what they're doing but you just need something labeled maybe, then that's
not a human-level experiment. You don't need IRB approval.
And that's a very, I mean, difficult line to walk. And I've had students who have done
things that the IRB has said no, that's definitely a human study, and the student's like
what are you talking about.
So I think the same thing is happening with Twitter data. IRBs are trying to wrap their
heads around this and figure it out.
The same thing is also happening with clinical data, by the way, because the sorts of
things that -- I work in clinical data as well. The sorts of things I want to do with clinical
data, when I talk to IRB people, it blows their mind that I'm even asking to do this, not
because of the things I'm doing are in terms of the -- you know, I'm doing unethical
surgeries or anything like that, but just the amount of data I want on populations is way
more than they're normally comfortable with.
And so I think IRBs are trying to figure out what to do because there is so much value in
that. And different IRBs have found different policies they want to do and different
security protocols and all this.
I think right now it's a big mess and people are trying to figure it out, but I think there is a
tendency towards figuring out ways to allow this sort of research. So this is a much more
expansive answer, but that's my not-one-hour talk on the topic.
Anything else? All right. So thank you, guys, very much. If you are watching online,
e-mail me questions. And if you were too embarrassed to get recorded and ask me a
question, you can come over and ask me. I'll take it off the mic. Thank you very much.
[applause]
Download