Tracking and Predicting the Spread of Disease using Social Media Henry Kautz Adam Sadilek Department of Computer Science Vincent Silenzio Department of Psychiatry Organic Sensor Networks • Detailed measurements at a population scale • • • No active user participation • • What will happen in the future? Fine granularity Timely • Inference & Prediction What factors (places/events/actions) influence health? Organic Sensor Networks • 52% of adults use online social networks ➚ • Smartphone access ➚ • • • Real time “In the moment” Location aware The Data • New York City • 16M tweets / month, 0.6M users • • • 2.5M tweets by geo-active users • 6K geo-active users 103K “follows” relationships 32K “friends” relationships 20:00 19:00 18:00 17:00 16:00 15:00 14:00 13:00 12:00 11:00 10:00 09:00 08:00 07:00 06:00 05:00 04:00 03:00 Language Model • How to automatically classify a tweet as “sick”? • Previous approaches: keywords, e.g. “sick” • Problems • “Sick of homework” • “Under the weather” • Our approach: statistical word trigram model • Each trigram is a feature (dimension) • Support vector machine: find a hyperplane that separates positive from negative examples sick +0.8 +0.8 sick and tired +0.8 +0.6 -0.7 +0.7 sick and tired of +0.8 +0.6 -0.7 -0.8 -0.1 sick and tired of flu +0.8 +0.6 -0.7 +0.6 +0.7 -0.8 How do we get these numbers??? Cascade SVM Random sample of 200 milion tweets Labeling Corpus of 5,128 tweets labeled by human workers Training Training Corpus of "other" tweets Cf Training Co Cs Corpus of 1.6 million machinelabeled tweets Labeling + Corpus of "sick" tweets Final corpus of the author. The and symbols denote thresholding of the classification score, where we select the bottom 10% of the scores predicted by Co (i.e., tweets that are normal with high probability), and the top 10% of scores predicted by Cs (i.e., likely “sick” tweets). SVM Features Positive Features Feature Weight sick 0.9579 headache 0.5249 flu 0.5051 fever 0.3879 feel 0.3451 coughing 0.2917 being sick 0.1919 better 0.1988 being 0.1943 stomach 0.1703 and my 0.1687 infection 0.1686 morning 0.1647 Negative Features Feature Weight sick of ´0.4005 you ´0.3662 lol ´0.3017 love ´0.1753 i feel your ´0.1416 so sick of ´0.0887 bieber fever ´0.1026 smoking ´0.0980 i’m sick of ´0.0894 pressure ´0.0837 massage ´0.0726 i love ´0.0719 pregnant ´0.0639 Be pun of em “#s sta or fea E sio the the Ne Co suf tha cia 200 “in Twitter Health • Aggregate accuracy comparable with: • Google Flu Trends (R = 0.73) • CDC statistics + we can model fine-grained interactions specific individuals between Conditional probability of getting sick at t +1 Impact of Co-Location 1 0.8 1 hour time window (T=1h) 4 hour time window (T=4h) 12 hour time window (T=12h) Prior probability of being sick f (x) = 0.002 e ( 0 . 0 5 4 x ) 0.6 0.4 f (x) = 0.013 e ( 0 . 0 5 5 x ) 0.2 f (x) = 0.001 e ( 0 . 0 5 5 x ) 0 0 20 40 60 80 100 120 Number of estimated encounters with sick individuals at time t Impact of Friendships Conditional probability of getting sick 0.2 0.15 Prob. of getting sick at t +1 given n friends are sick at t Prob. of getting sick given having n friends (any) Prob. of getting sick at t +1 given n unencountred friends are sick at t Prior probability of being sick 0.1 f (x) = 0.003 e ( 0 . 4 1 3 x ) 0.05 0 0 2 4 6 Number of friends (n) 8 10 (Sadilek et al AAAI 2012) Health Prediction ... ht −1 ht ht +1 Ot −1 Ot Ot +1 ... Health Insights [Snow,1855] Social Network Centrality Correlates with Health Factors Influencing Health (Sadilek & Kautz WSDM 2013) Disease Hubs & Vectors (Brenan et al IJCAI 2013) Graphical Summary ? The Data target users: tweeted from more than one airport Validating the Twitter Health Signal Feature Tf • Computed weekly: • • Center for Disease Control flu measure Cf Google Flu measure Gf • Computed daily: Twitter Health measure T f Validating Tf • NYC, Boston, Los Angeles, Seattle, San Francisco • T correlated with C (R=0.80, p=0.002) • T correlated with G (R=0.87, p=0.0002) f f f f Volume and Sick Traveller Features • f(t, x→y) = # Twitter users who flew from airport x to airport y • • User tweeted from x on day t User tweeted from y earlier on day t or on day t-1 • V(t,x) = # Twitters users who flew into x on day t • f (t, x→y) = # sick Twitter users who flew from from s airport x to airport y • User made “sick” tweet on day t or t-1 • S(t,x) = # sick Twitters users who flew into x on day t Meeting Feature • Two users assumed to meet if they appear within 100 meters of each other within one hour • M(t,x) = # meetings that users traveling to airport x on day t had with sick users on days t or t-1 • Captures number of exposed individuals traveling to x Measuring Explanatory Power of Features • Goal: explain weekly change in Google Flu measure, ΔGf, in each city x • Linear regression over features from prior 7 days features explains % of ΔGf V(t, x) V(t, x), S(t,x) V(t, x), S(t,x), M(t,x) 56% 73% 78% Prediction • Goal: predict T for city x on a given day f using V(x,t), S(x,t), M(x,t) for 3 previous days • Single linear regression model for all cities • Our prediction of a city's flu index next week is within 7% of the true value 95% of the time nEmesis: Foodborne Illness Surveillance Learning a Food Poisoning Language Model How to learn accurate model for very rare messages? Leverage Human-Machine Computation Iterative Improvement of Language Model Summary media can be mined to estimate the • Social impact of physical encounters, social ties, and behavioral and environmental factors on public health • Cheap and fast (“data exhaust”) new tool for identifying movement of • Unique disease and individual disease carrying individuals on a global scale • Could support novel interventions