Document 14025272

advertisement
Tracking and Predicting
the Spread of Disease
using Social Media
Henry Kautz Adam Sadilek
Department of Computer Science
Vincent Silenzio
Department of Psychiatry
Organic Sensor
Networks
• Detailed measurements at a population scale
•
•
•
No active user participation
•
•
What will happen in the future?
Fine granularity
Timely
• Inference & Prediction
What factors (places/events/actions) influence health?
Organic Sensor
Networks
• 52% of adults use online social networks ➚
• Smartphone access ➚
•
•
•
Real time
“In the moment”
Location aware
The Data
• New York City
•
16M tweets / month, 0.6M users
•
•
•
2.5M tweets by geo-active users
• 6K geo-active users
103K “follows” relationships
32K “friends” relationships
20:00
19:00
18:00
17:00
16:00
15:00
14:00
13:00
12:00
11:00
10:00
09:00
08:00
07:00
06:00
05:00
04:00
03:00
Language Model
• How to automatically classify a tweet as
“sick”?
• Previous approaches: keywords, e.g. “sick”
• Problems
• “Sick of homework”
• “Under the weather”
• Our approach: statistical word trigram model
• Each trigram is a feature (dimension)
• Support vector machine: find a hyperplane that
separates positive from negative examples
sick
+0.8
+0.8
sick and tired
+0.8
+0.6
-0.7
+0.7
sick and tired of
+0.8
+0.6
-0.7
-0.8
-0.1
sick and tired of flu
+0.8
+0.6
-0.7
+0.6
+0.7
-0.8
How do we get these numbers???
Cascade SVM
Random sample of
200 milion tweets
Labeling
Corpus of
5,128 tweets
labeled by
human
workers
Training
Training
Corpus of
"other"
tweets
Cf
Training
Co
Cs
Corpus of 1.6
million machinelabeled tweets
Labeling
+
Corpus
of "sick"
tweets
Final
corpus
of the author. The
and
symbols denote thresholding of the
classification score, where we select the bottom 10% of the scores
predicted by Co (i.e., tweets that are normal with high probability), and the top 10% of scores predicted by Cs (i.e., likely “sick”
tweets).
SVM Features
Positive Features
Feature
Weight
sick
0.9579
headache
0.5249
flu
0.5051
fever
0.3879
feel
0.3451
coughing
0.2917
being sick 0.1919
better
0.1988
being
0.1943
stomach
0.1703
and my
0.1687
infection
0.1686
morning
0.1647
Negative Features
Feature
Weight
sick of
´0.4005
you
´0.3662
lol
´0.3017
love
´0.1753
i feel your
´0.1416
so sick of
´0.0887
bieber fever ´0.1026
smoking
´0.0980
i’m sick of
´0.0894
pressure
´0.0837
massage
´0.0726
i love
´0.0719
pregnant
´0.0639
Be
pun
of
em
“#s
sta
or
fea
E
sio
the
the
Ne
Co
suf
tha
cia
200
“in
Twitter Health
• Aggregate accuracy comparable with:
• Google Flu Trends (R = 0.73)
• CDC statistics
+ we can model fine-grained interactions
specific individuals
between
Conditional probability of getting sick at t +1
Impact of Co-Location
1
0.8
1 hour time window (T=1h)
4 hour time window (T=4h)
12 hour time window (T=12h)
Prior probability of being sick
f (x) = 0.002 e ( 0 . 0 5 4 x )
0.6
0.4
f (x) = 0.013 e ( 0 . 0 5 5 x )
0.2
f (x) = 0.001 e ( 0 . 0 5 5 x )
0
0
20
40
60
80
100
120
Number of estimated encounters with sick individuals at time t
Impact of Friendships
Conditional probability of getting sick
0.2
0.15
Prob. of getting sick at t +1 given n friends are sick at t
Prob. of getting sick given having n friends (any)
Prob. of getting sick at t +1 given n unencountred friends are sick at t
Prior probability of being sick
0.1
f (x) = 0.003 e ( 0 . 4 1 3 x )
0.05
0
0
2
4
6
Number of friends (n)
8
10
(Sadilek et al AAAI 2012)
Health Prediction
...
ht −1
ht
ht +1
Ot −1
Ot
Ot +1
...
Health Insights
[Snow,1855]
Social Network Centrality Correlates
with Health
Factors Influencing Health
(Sadilek & Kautz WSDM 2013)
Disease Hubs & Vectors
(Brenan et al IJCAI 2013)
Graphical Summary
?
The Data
target users: tweeted from more than one airport
Validating the Twitter
Health Signal Feature Tf
• Computed weekly:
•
•
Center for Disease Control flu measure Cf
Google Flu measure Gf
• Computed daily: Twitter Health measure T
f
Validating Tf
• NYC, Boston, Los Angeles, Seattle, San Francisco
• T correlated with C (R=0.80, p=0.002)
• T correlated with G (R=0.87, p=0.0002)
f
f
f
f
Volume and Sick
Traveller Features
• f(t, x→y) = # Twitter users who flew from airport x to
airport y
•
•
User tweeted from x on day t
User tweeted from y earlier on day t or on day t-1
• V(t,x) = # Twitters users who flew into x on day t
• f (t, x→y) = # sick Twitter users who flew from from
s
airport x to airport y
•
User made “sick” tweet on day t or t-1
• S(t,x) = # sick Twitters users who flew into x on day t
Meeting Feature
• Two users assumed to meet if they appear
within 100 meters of each other within one
hour
• M(t,x) = # meetings that users traveling to
airport x on day t had with sick users on
days t or t-1
• Captures number of exposed individuals
traveling to x
Measuring Explanatory
Power of Features
• Goal: explain weekly change in Google Flu
measure, ΔGf, in each city x
• Linear regression over features from prior 7
days
features
explains % of ΔGf
V(t, x)
V(t, x), S(t,x)
V(t, x), S(t,x), M(t,x)
56%
73%
78%
Prediction
• Goal: predict T for city x on a given day
f
using V(x,t), S(x,t), M(x,t) for 3 previous
days
• Single linear regression model for all cities
• Our prediction of a city's flu index next
week is within 7% of the true value 95% of
the time
nEmesis:
Foodborne Illness
Surveillance
Learning a Food Poisoning
Language Model
How to learn accurate model for very rare messages?
Leverage Human-Machine Computation
Iterative Improvement of
Language Model
Summary
media can be mined to estimate the
• Social
impact of physical encounters, social ties, and
behavioral and environmental factors on public
health
• Cheap and fast (“data exhaust”)
new tool for identifying movement of
• Unique
disease and individual disease carrying
individuals on a global scale
• Could support novel interventions
Download