Understanding and Predicting Human Behavior using Propagation: From Flu-trends to Cyber-Security B. Aditya Prakash Computer Science Virginia Tech. Keynote Talk, BEAMS Workshop, ICDM, Nov 14, 2015 Thanks! • Reza Zafarani • Huan Liu Prakash 2015 2 Networks are everywhere! Facebook Network [2010] Gene Regulatory Network [Decourty 2008] Human Disease Network [Barabasi 2007] The Internet [2005] Prakash 2015 3 Dynamical Processes over networks are also everywhere! Prakash 2015 4 Why do we care? • Social collaboration • Information Diffusion • Viral Marketing • Epidemiology and Public Health • Cyber Security • Human mobility • Games and Virtual Worlds • Ecology ........ Prakash 2015 5 Why do we care? (1: Epidemiology) • Dynamical Processes over networks [AJPH 2007] SI Model Diseases over contact networks Prakash 2015 CDC data: Visualization of the first 35 tuberculosis (TB) patients and their 1039 contacts 6 Why do we care? (1: Epidemiology) • Dynamical Processes over networks • Each circle is a hospital • ~3000 hospitals • More than 30,000 patients transferred [US-MEDICARE NETWORK 2005] Problem: Given k units of disinfectant, whom to immunize? Prakash 2015 7 Why do we care? (1: Epidemiology) ~6x fewer! CURRENT PRACTICE [US-MEDICARE NETWORK 2005] OUR METHOD Prakash 2015 8 Hospital-acquired inf. took 99K+ lives, cost $5B+ (all per year) Why do we care? (2: Online Diffusion) > 800m users, ~$1B revenue [WSJ 2010] ~100m active users > 50m users Prakash 2015 9 Why do we care? (2: Online Diffusion) • Dynamical Processes over networks Buy Versace™! Followers Celebrity Social Media Marketing Prakash 2015 10 Why do we care? (3: To change the world?) • Dynamical Processes over networks Social networks and Collaborative Action Prakash 2015 11 High Impact – Multiple Settings epidemic out-breaks Q. How to squash rumors faster? products/viruses Q. How do opinions spread? transmit s/w patches Q. How to market better? Prakash 2015 12 Research Theme ANALYSIS Understanding POLICY/ ACTION DATA Large real-world networks & processes Prakash 2015 Managing/Utili zing 13 Research Theme – Public Health ANALYSIS Will an epidemic happen? POLICY/ ACTION DATA Modeling # patient transfers Prakash 2015 How to control out-breaks? 14 Research Theme – Social Media ANALYSIS # cascades in future? POLICY/ ACTION DATA Modeling Tweets spreading Prakash 2015 How to market better?15 In this talk Q1: How to predict Flutrends better? DATA Large real-world networks & processes Q2: How does information evolve over time? Prakash 2015 16 In this talk Q3: How do malware attacks evolve over time? DATA Large real-world networks & processes Prakash 2015 17 Outline • Motivation • Part 1: Learning Models (Empirical Studies) – Q1: How to predict Flu-trends better? – Q2: How does information evolve over time? – Q3: How does malware attacks evolve over time? • Conclusion Prakash 2015 18 [Chen et. al. ICDM 2014] Surveillance • How to estimate and predict flu trends? Hospital record Lab survey Population survey 19 Prakash 2015 Surveillance Report GFT & Twitter • Estimate flu trends using online electronic sources So cold today, I’m catching cold. I have headache, sore throat, I can’t go to school today. My nose is totally congested, I have a hard time understanding what I’m saying. 20 Prakash 2015 Observation 1: States • There are different states in an infection cycle. • SEIR model: 1. Susceptible 3. Infected 21 2. Exposed 4. Recovered Prakash 2015 Observation 2: Ep. & So. Gap • Infection cases drop exponentially in epidemiology (Hethcote 2000) 22 • Keyword mentions drop in a power-law pattern in social media (Matsubara 2012) Prakash 2015 HFSTM Model • Hidden Flu-State from Tweet Model (HFSTM) – Each word (w) in a tweet (Oi) can be generated by: Initial prob. • A background topic • Non-flu related topics • State related topics Transit. switch Binary nonflu related switch Latent state Transit. prob. Binary background switch 23 Prakash 2015 Word distribution HFSTM Model • Generating tweets Generate the state for a tweet Generate the topic for a word State: [S,E,I] Topic: [Background, Non-flu, State] S: This restaurant is really good E: The movie was good but it was freezing I: I think I have flu 24 Prakash 2015 Inference • EM-based algorithm: HFSTM-FIT – E-step: • At(i)=P(O1,O2,…,Ot,St=i) • Bt(i)=P(Ot+1,…,OTu|St=i) • γt(i)=P(St=i|Ou) – M-step: • Other parameters such as state transition probabilities, topic distributions, etc. – Parameters learned: 25 Prakash 2015 A possible issue with HFSTM • Suffers from large, noisy vocabulary. • Semi-supervision for improvement – Introduce weak supervision into HFSTM. 26 Prakash 2015 HFSTM-A • HFSTM-A(spect) – Introduce an aspect variable y, expressing our belief on whether a word is flu-related or not. – The value of y biases the switch variables s.t. flu-related words are more likely to be explained by state topics. When the aspect value (y) is introduced, the switching probability are updated accordingly. 27 Prakash 2015 Vocabulary & Dataset • Vocabulary (230 words): – Flu-related keyword list by Chakraborty SDM 2014 – Extra state-related keyword list • Dataset (34,000 tweets): – Identify infected users and collect their tweets – Train on data from Jun 20, 2013-Aug 06, 2013 – Test on two time period: • Dec 01, 2012- July 08, 2013 • Nov 10, 2013-Jan 26, 2014 28 Prakash 2015 Learned word distributions • The most probable words learned in each state Probably healthy: S 29 Having symptons: E Prakash 2015 Definitely sick: I Learned state transition Transition probabilities Transition in real tweets Learned by HFSTM: Not directly flu-related, yet correctly identified 30 Prakash 2015 Flu trend fitting • Ground-truth: – The Pan American Health Organization (PAHO) • Algorithms: – Baseline: • Count the number of keywords weekly as features, and regress to the ground-truth curve. – Google flu trend: • Take the google flu trend data as input, regress to the PAHO curve. – HFSTM: • Distinguish different states of keyword, and only use the number of keywords in I state. Again regress to PAHO. 31 Prakash 2015 Flu trend fitting • Linear regression to the case count reported by PAHO (the ground-truth) 32 Prakash 2015 HFSTM-A • Results are qualitatively similar with HFSTM, when the vocabulary is 10 times larger. 33 Prakash 2015 Outline • Motivation • Part 1: Learning Models (Empirical Studies) – Q1: How to predict Flu-trends better? – Q2: How does information evolve over time? – Q3: How does malware attacks evolve over time? • Conclusion Prakash 2015 34 Google Search Volume (1) First spike (2) Release date (3) Two weeks before release ? ? e.g., given (1) first spike, (2) release date of two sequel movies (3) access volume before the release date Prakash 2015 35 Patterns Y X Prakash 2015 36 Patterns Y More Data X Prakash 2015 37 Patterns Y Anomaly ? X Prakash 2015 38 Patterns Y Anomaly ? Extrapolation X Prakash 2015 39 Patterns Y Imputation Anomaly Extrapolation X Prakash 2015 40 Patterns Imputation Anomaly Compression Extrapolation Prakash 2015 41 Rise and fall patterns in social media • Meme (# of mentions in blogs) – short phrases Sourced from U.S. politics in 2008 “you can put lipstick on a pig” “yes we can” Prakash 2015 42 Rise and fall patterns in social media • Can we find a unifying model, which includes these patterns? • four classes on YouTube [Crane et al. ’08] • six classes on Meme [Yang et al. ’11] 100 100 100 50 50 50 0 0 100 50 100 50 0 0 0 0 100 50 100 50 50 100 0 0 0 0 100 50 100 50 100 50 50 Prakash 2015 100 0 0 43 Rise and fall patterns in social media • Answer: YES! 0 20 40 60 80 Time 20 40 60 80 Time 100 120 Value 20 40 60 80 Time 50 0 20 40 60 80 Time 100 120 50 0 100 120 Original SpikeM 100 Value Value 50 0 0 100 120 Original SpikeM 100 50 Original SpikeM 100 20 40 60 80 Time 100 120 Original SpikeM 100 Value Value 50 Original SpikeM 100 Value Original SpikeM 100 50 0 20 40 60 80 Time 100 120 • We can represent all patterns by single model In Matsubara, Sakurai, Prakash+ SIGKDD 2012 Prakash 2015 44 Main idea - SpikeM - 1. Un-informed bloggers (uninformed about rumor) - 2. External shock at time nb (e.g, breaking news) - 3. Infection (word-of-mouth) Time n=0 Time n=nb Infectiveness of a blog-post at age n: b f (n) Time n=nb+1 f (n) = b * n-1.5 - Strength of infection (quality of news) - Decay function (how infective a blog posting is) Prakash 2015 β Power Law 45 -1.5 slope J. G. Oliveira et. al. Human Dynamics: The Correspondence Patterns of Darwin and Einstein. Nature 437, 1251 (2005) . [PDF] (also in Leskovec, McGlohon+, SDM 2007) Prakash 2015 46 SpikeM - with periodicity • Full equation of SpikeM n é ù DB(n +1) = p(n +1)× êU(n)× å (DB(t) + S(t))× f (n +1- t) + e ú ê ú ë t=n û b Periodicity 12pm Peak activity Bloggers change their activity over time activity 3am Low activity p(n) (e.g., daily, weekly, yearly) Time n Prakash 2015 47 Tail-part forecasts • SpikeM can capture tail part Prakash 2015 48 “What-if” forecasting (1) First spike (2) Release date (3) Two weeks before release ? ? e.g., given (1) first spike, (2) release date of two sequel movies (3) access volume before the release date Prakash 2015 49 “What-if” forecasting –SpikeM can forecast not only tail-part, but also rise-part! (1) First spike (2) Release date (3) Two weeks before release • SpikeM can forecast upcoming spikes Prakash 2015 50 Bonus: Protest Predictions Violent Protest (VP) [Sundereisan et al. ASONAM 2014] [Jin et al. SIGKDD 2014] • Can Twitter provide a lead time? • South American twitter dataset – Language: Spanish/Portuguese – Idea 1. Look for trending keywords. 2. Predict event type for protest using SpikeM parameters! VP A political tweet Prakash 2015 P Non Violent Protest (P) 51 Outline • Motivation • Part 1: Learning Models (Empirical Studies) – Q1: How to predict Flu-trends better? – Q2: How does information evolve over time? – Q3: How does malware attacks evolve over time? • Conclusion Prakash 2015 52 Modeling Malware Penetration • Worldwide Intelligence Network – Which machine got which malware (or legitimate files) – 1 Billion nodes – 37 Billion edges • Q: Temporal patterns? Prakash 2015 53 Q: Temporal Patterns Looks familiar? Prakash 2015 54 [Papalexakakis et. al. ASONAM 2013] SpikeM again (or SharkFin) 7 parameters only! ~ 400 points Prakash 2015 ~ 400 points 55 Latent Propagation Patterns Prakash 2015 56 BUT • Does not take into account differences between detections vs actual infections. Prakash 2015 57 [Chan et. al. WSDM 2016] Domain-based approach: Data • Looked at the entire 2 years of WINE data. • Augmented with vulnerability and patch data from NIST’s National Vulnerability Database (NVD) • Considered all machines from 40 countries – study still ongoing. Considered the 50 most commonly occurring malware. Prakash 2015 58 Study Approach: Main Steps Prakash 2015 59 Study Approach: Patch & Detection Incompetence • Incompetence : 4 base variables to measure hosts' incompetence in detecting malware and incompetence in patching (absolute and relative) w.r.t. various time period. How much time each host took in detecting or patching for each malware • For each time tick, we built a directed bipartite graph capturing normalized detection/patching incompetence between malware and hosts Prakash 2015 60 FBP Model • Dependent variable: For each (c,m) pair, the % of hosts in the country c attacked by malware m. • Independent variables for each (c,m) pair: – ADI, API, RDI, RPI, AADI, ARDI,AAPI, ARPI, ADA, RDA, APA and RPA of hosts in country c, APH and RPH of malware m – Six similarity measures for hosts in two different countries – Per Capita GDP and HDI of countries – Found k-nearest neighbors of each (c,m) pair according to different similarity measures and used features of those countries as well. Prakash 2015 61 DIPS and DIPS-EXP Model • Infection rate 𝛽 𝑡 . • Patching rates: – Susceptible hosts: 𝜃(𝑡) – Detected hosts: 𝛿(𝑡) Developed algorithm to learn best parameters for DIPS and DIPS-Exp model by minimizing error terms. Prakash 2015 62 Learning DIPS, DIPS-Exp Parameters Prakash 2015 63 Ensemble Models Prakash 2015 64 Experiments : Overall • • • We predict infection ratios of hosts in each country for each malware Test all country-malware pairs for top 50 malware and top 40 GDP countries w.r.t. # of infections NRMSE is important because infections ratios over countries are very different FBP shows better performance than FUNNEL w.r.t. all performance measures DIPS shows better performance than FBP w.r.t. all performance measures ESM0 is the best w.r.t. NRMSE FUNNEL*: disease infection prediction model FBP + FUNNEL does not work The MAE* values were computed with |# of ground true infected hosts – the expected # of infected hosts| Prakash 2015 65 Experiments Prakash 2015 66 Summary of Forecasting Experiments • FBP, DIPS and ESM showed better performance when there were lots of infection attempts. • FBP showed reliable performance across the board • DIPS was very accurate when infectiousness level is high • ESM takes both advantages of FBP and DIPS and shows very accurate and reliable performance Prakash 2015 67 Outline • Motivation • Part 1: Learning Models (Empirical Studies) – Q1: How to predict Flu-trends better? – Q2: How does information evolve over time? – Q3: How does malware attacks evolve over time? • Conclusion Prakash 2015 68 Future Plans ANALYSIS Understanding POLICY/ ACTION DATA Large real-world networks & processes Managing Prakash 2015 69 Scalability – Big Data • Datasets of unprecedented scale – High dimensionality and sample size! • Need scalable algorithms for – Learning Models – Developing Policy • Leverage parallel systems – Map-Reduce clusters (like Hadoop) for data-intensive jobs (more than 6000 machines) – Parallelized compute-intensive simulations (like Condor) Prakash 2015 70 Uncertain Data in Cascade analysis (more implementable policies) Correcting for missing data Original, Nodes sampled off Designing More Robust Immunization Policies Culprits, and missing nodes filled in Zhang and Prakash. CIKM 2014 Sundereisan, Vreeken, Prakash. 2014 Prakash 2015 71 Summarization • Automatic segmentation? ig. 6: M DSA S segmentation result for Peru: word clouds he three segments • Segment fludetected. cascades? ……. bola: M DSA S, EMP and TopicM all have a satisfactory Q alue (see Fig. 4b). As we explained in Sec. V-B, the l ⇤ va earned by M DSA S is close to |X |, and p( x̃ i |y) ⇡ p(x j | Prakash 2015 72 References 1. 2. Scalable Vaccine Distribution in Large Graphs given Uncertain Data (Yao Zhang and B. Aditya Prakash) -- In CIKM 2014. Fast Influence-based Coarsening for Large Networks (Manish Purohit, B. Aditya Prakash, Chahhyun Kang, Yao Zhang and V. S. Subrahmanian) – In SIGKDD 2014 3. 4. DAVA: Distributing Vaccines over Large Networks under Prior Information (Yao Zhang and B. Aditya Prakash) -- In SDM 2014 Fractional Immunization on Networks (B. Aditya Prakash, Lada Adamic, Jack Iwashnya, Hanghang Tong, Christos Faloutsos) – In SDM 2013 Spotting Culprits in Epidemics: Who and How many? (B. Aditya Prakash, Jilles Vreeken, Christos Faloutsos) – In ICDM 2012, Brussels Vancouver (Invited to KAIS Journal Best Papers of ICDM.) Gelling, and Melting, Large Graphs through Edge Manipulation (Hanghang Tong, B. Aditya Prakash, Tina Eliassi-Rad, Michalis Faloutsos, Christos Faloutsos) – In ACM CIKM 2012, Hawaii (Best Paper Award) Rise and Fall Patterns of Information Diffusion: Model and Implications (Yasuko Matsubara, Yasushi Sakurai, B. Aditya Prakash, Lei Li, Christos Faloutsos) – In SIGKDD 2012, Beijing Interacting Viruses on a Network: Can both survive? (Alex Beutel, B. Aditya Prakash, Roni Rosenfeld, Christos Faloutsos) – In SIGKDD 2012, Beijing 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. Winner-takes-all: Competing Viruses or Ideas on fair-play networks (B. Aditya Prakash, Alex Beutel, Roni Rosenfeld, Christos Faloutsos) – In WWW 2012, Lyon Threshold Conditions for Arbitrary Cascade Models on Arbitrary Networks (B. Aditya Prakash, Deepayan Chakrabarti, Michalis Faloutsos, Nicholas Valler, Christos Faloutsos) - In IEEE ICDM 2011, Vancouver (Invited to KAIS Journal Best Papers of ICDM.) Times Series Clustering: Complex is Simpler! (Lei Li, B. Aditya Prakash) - In ICML 2011, Bellevue Epidemic Spreading on Mobile Ad Hoc Networks: Determining the Tipping Point (Nicholas Valler, B. Aditya Prakash, Hanghang Tong, Michalis Faloutsos and Christos Faloutsos) – In IEEE NETWORKING 2011, Valencia, Spain Formalizing the BGP stability problem: patterns and a chaotic model (B. Aditya Prakash, Michalis Faloutsos and Christos Faloutsos) – In IEEE INFOCOM NetSciCom Workshop, 2011. On the Vulnerability of Large Graphs (Hanghang Tong, B. Aditya Prakash, Tina Eliassi-Rad and Christos Faloutsos) – In IEEE ICDM 2010, Sydney, Australia Virus Propagation on Time-Varying Networks: Theory and Immunization Algorithms (B. Aditya Prakash, Hanghang Tong, Nicholas Valler, Michalis Faloutsos and Christos Faloutsos) – In ECML-PKDD 2010, Barcelona, Spain MetricForensics: A Multi-Level Approach for Mining Volatile Graphs (Keith Henderson, Tina Eliassi-Rad, Christos Faloutsos, Leman Akoglu, Lei Li, Koji Maruhashi, B. Aditya Prakash and Hanghang Tong) - In SIGKDD 2010, Washington D.C. Prakash 2015 73 Acknowledgements Collaborators Deepayan Chakrabarti, Hanghang Tong, Kunal Punera, Ashwin Sridharan, Sridhar Machiraju, Mukund Seshadri, Alice Zheng, Lei Li, Polo Chau, Nicholas Valler, Alex Beutel, Xuetao Wei Christos Faloutsos Roni Rosenfeld, Michalis Faloutsos, Lada Adamic, Theodore Iwashyna (M.D.), Dave Andersen, Tina Eliassi-Rad, Iulian Neamtiu, Varun Gupta, Jilles Vreeken, V. S. Subrahmanian John Brownstein (M.D.) Prakash 2015 74 Acknowledgements • Students Liangzhe Chen Shashidhar Sundereisan Benjamin Wang Yao Zhang Sorour Amiri Prakash 2015 75 Acknowledgements Funding Prakash 2015 76 Making Diffusion Work for You B. Aditya Prakash http://www.cs.vt.edu/~badityap Analysis Policy/Action Prakash 2015 Data 77