15-11-14-aditya-beams-icdm-talk - People

advertisement
Understanding and Predicting Human
Behavior using Propagation:
From Flu-trends to Cyber-Security
B. Aditya Prakash
Computer Science
Virginia Tech.
Keynote Talk, BEAMS Workshop, ICDM, Nov 14, 2015
Thanks!
• Reza Zafarani
• Huan Liu
Prakash 2015
2
Networks are everywhere!
Facebook Network [2010]
Gene Regulatory Network
[Decourty 2008]
Human Disease Network
[Barabasi 2007]
The Internet [2005]
Prakash 2015
3
Dynamical Processes over networks
are also everywhere!
Prakash 2015
4
Why do we care?
• Social collaboration
• Information Diffusion
• Viral Marketing
• Epidemiology and Public Health
• Cyber Security
• Human mobility
• Games and Virtual Worlds
• Ecology
........
Prakash 2015
5
Why do we care? (1: Epidemiology)
• Dynamical Processes over networks
[AJPH 2007]
SI Model
Diseases over contact networks
Prakash 2015
CDC data: Visualization of
the first 35 tuberculosis
(TB) patients and their
1039 contacts
6
Why do we care? (1: Epidemiology)
• Dynamical Processes over networks
• Each circle is a hospital
• ~3000 hospitals
• More than 30,000 patients
transferred
[US-MEDICARE
NETWORK 2005]
Problem: Given k units of
disinfectant, whom to immunize?
Prakash 2015
7
Why do we care? (1: Epidemiology)
~6x
fewer!
CURRENT PRACTICE
[US-MEDICARE
NETWORK 2005]
OUR METHOD
Prakash 2015
8
Hospital-acquired inf. took 99K+
lives, cost $5B+ (all per year)
Why do we care? (2: Online
Diffusion)
> 800m users, ~$1B
revenue [WSJ 2010]
~100m active users
> 50m users
Prakash 2015
9
Why do we care? (2: Online
Diffusion)
• Dynamical Processes over networks
Buy Versace™!
Followers
Celebrity
Social Media Marketing
Prakash 2015
10
Why do we care?
(3: To change the world?)
• Dynamical Processes over networks
Social networks and Collaborative Action
Prakash 2015
11
High Impact – Multiple Settings
epidemic out-breaks
Q. How to squash rumors faster?
products/viruses
Q. How do opinions spread?
transmit s/w patches
Q. How to market better?
Prakash 2015
12
Research Theme
ANALYSIS
Understanding
POLICY/
ACTION
DATA
Large real-world
networks & processes
Prakash 2015
Managing/Utili
zing 13
Research Theme – Public Health
ANALYSIS
Will an epidemic
happen?
POLICY/
ACTION
DATA
Modeling # patient
transfers
Prakash 2015
How to control
out-breaks?
14
Research Theme – Social Media
ANALYSIS
# cascades in
future?
POLICY/
ACTION
DATA
Modeling Tweets
spreading
Prakash 2015
How to market
better?15
In this talk
Q1: How to predict Flutrends better?
DATA
Large real-world
networks & processes
Q2: How does information
evolve over time?
Prakash 2015
16
In this talk
Q3: How do malware
attacks evolve over time?
DATA
Large real-world
networks & processes
Prakash 2015
17
Outline
• Motivation
• Part 1: Learning Models (Empirical Studies)
– Q1: How to predict Flu-trends better?
– Q2: How does information evolve over time?
– Q3: How does malware attacks evolve over time?
• Conclusion
Prakash 2015
18
[Chen et. al. ICDM 2014]
Surveillance
• How to estimate and predict flu trends?
Hospital record
Lab survey
Population survey
19
Prakash 2015
Surveillance
Report
GFT & Twitter
• Estimate flu trends using online electronic
sources
So cold today, I’m catching cold.
I have headache, sore throat, I can’t
go to school today.
My nose is totally congested, I have
a hard time understanding what I’m
saying.
20
Prakash 2015
Observation 1: States
• There are different states in an infection cycle.
• SEIR model:
1. Susceptible
3. Infected
21
2. Exposed
4. Recovered
Prakash 2015
Observation 2:
Ep. & So. Gap
• Infection cases drop
exponentially in
epidemiology (Hethcote
2000)
22
• Keyword mentions drop
in a power-law pattern
in social media
(Matsubara 2012)
Prakash 2015
HFSTM Model
• Hidden Flu-State from Tweet Model (HFSTM)
– Each word (w) in a tweet (Oi) can be generated by:
Initial
prob.
• A background topic
• Non-flu related topics
• State related topics
Transit.
switch
Binary nonflu related
switch
Latent
state
Transit.
prob.
Binary
background
switch
23
Prakash 2015
Word
distribution
HFSTM Model
• Generating tweets
Generate the state for a tweet
Generate the topic for a word
State: [S,E,I]
Topic: [Background,
Non-flu,
State]
S: This restaurant is really good
E: The movie was good
but it was freezing
I: I think I have flu
24
Prakash 2015
Inference
• EM-based algorithm: HFSTM-FIT
– E-step:
• At(i)=P(O1,O2,…,Ot,St=i)
• Bt(i)=P(Ot+1,…,OTu|St=i)
• γt(i)=P(St=i|Ou)
– M-step:
• Other parameters such as state transition probabilities,
topic distributions, etc.
– Parameters learned:
25
Prakash 2015
A possible issue with HFSTM
• Suffers from large, noisy vocabulary.
• Semi-supervision for improvement
– Introduce weak supervision into HFSTM.
26
Prakash 2015
HFSTM-A
• HFSTM-A(spect)
– Introduce an aspect variable y, expressing our belief on
whether a word is flu-related or not.
– The value of y biases the switch variables s.t. flu-related
words are more likely to be explained by state topics.
When the aspect value (y) is
introduced, the switching probability
are updated accordingly.
27
Prakash 2015
Vocabulary & Dataset
• Vocabulary (230 words):
– Flu-related keyword list by Chakraborty SDM
2014
– Extra state-related keyword list
• Dataset (34,000 tweets):
– Identify infected users and collect their tweets
– Train on data from Jun 20, 2013-Aug 06, 2013
– Test on two time period:
• Dec 01, 2012- July 08, 2013
• Nov 10, 2013-Jan 26, 2014
28
Prakash 2015
Learned word distributions
• The most probable words learned in each state
Probably healthy: S
29
Having symptons: E
Prakash 2015
Definitely sick: I
Learned state transition
Transition probabilities
Transition in real tweets
Learned by HFSTM:
Not directly flu-related,
yet correctly identified
30
Prakash 2015
Flu trend fitting
• Ground-truth:
– The Pan American Health Organization (PAHO)
• Algorithms:
– Baseline:
• Count the number of keywords weekly as features, and
regress to the ground-truth curve.
– Google flu trend:
• Take the google flu trend data as input, regress to the PAHO
curve.
– HFSTM:
• Distinguish different states of keyword, and only use the
number of keywords in I state. Again regress to PAHO.
31
Prakash 2015
Flu trend fitting
• Linear regression to the case count
reported by PAHO (the ground-truth)
32
Prakash 2015
HFSTM-A
• Results are qualitatively similar with HFSTM,
when the vocabulary is 10 times larger.
33
Prakash 2015
Outline
• Motivation
• Part 1: Learning Models (Empirical Studies)
– Q1: How to predict Flu-trends better?
– Q2: How does information evolve over time?
– Q3: How does malware attacks evolve over time?
• Conclusion
Prakash 2015
34
Google Search Volume
(1) First spike
(2) Release date
(3) Two weeks before release
?
?
e.g., given (1) first spike,
(2) release date of two sequel movies
(3) access volume before the release date
Prakash 2015
35
Patterns
Y
X
Prakash 2015
36
Patterns
Y
More Data
X
Prakash 2015
37
Patterns
Y
Anomaly
?
X
Prakash 2015
38
Patterns
Y
Anomaly
?
Extrapolation
X
Prakash 2015
39
Patterns
Y
Imputation
Anomaly
Extrapolation
X
Prakash 2015
40
Patterns
Imputation
Anomaly
Compression
Extrapolation
Prakash 2015
41
Rise and fall patterns in social media
• Meme (# of mentions in blogs)
– short phrases Sourced from U.S. politics in 2008
“you can put lipstick on a pig”
“yes we can”
Prakash 2015
42
Rise and fall patterns in social media
• Can we find a unifying model, which
includes these patterns?
• four classes on YouTube [Crane et al. ’08]
• six classes on Meme [Yang et al. ’11]
100
100
100
50
50
50
0
0
100
50
100
50
0
0
0
0
100
50
100
50
50
100
0
0
0
0
100
50
100
50
100
50
50
Prakash 2015
100
0
0
43
Rise and fall patterns in social media
• Answer: YES!
0
20
40
60 80
Time
20
40
60 80
Time
100 120
Value
20
40
60 80
Time
50
0
20
40
60 80
Time
100 120
50
0
100 120
Original
SpikeM
100
Value
Value
50
0
0
100 120
Original
SpikeM
100
50
Original
SpikeM
100
20
40
60 80
Time
100 120
Original
SpikeM
100
Value
Value
50
Original
SpikeM
100
Value
Original
SpikeM
100
50
0
20
40
60 80
Time
100 120
• We can represent all patterns by single model
In Matsubara, Sakurai, Prakash+ SIGKDD 2012
Prakash 2015
44
Main idea - SpikeM
- 1. Un-informed bloggers (uninformed about rumor)
- 2. External shock at time nb (e.g, breaking news)
- 3. Infection (word-of-mouth)
Time n=0
Time n=nb
Infectiveness of a blog-post at age n:
b
f (n)
Time n=nb+1
f (n) = b * n-1.5
- Strength of infection (quality of news)
- Decay function (how infective a blog posting is)
Prakash 2015
β
Power Law
45
-1.5 slope
J. G. Oliveira et. al. Human Dynamics: The
Correspondence Patterns of Darwin and Einstein.
Nature 437, 1251 (2005) . [PDF]
(also in Leskovec, McGlohon+, SDM 2007)
Prakash 2015
46
SpikeM - with periodicity
• Full equation of SpikeM
n
é
ù
DB(n +1) = p(n +1)× êU(n)× å (DB(t) + S(t))× f (n +1- t) + e ú
ê
ú
ë
t=n
û
b
Periodicity
12pm
Peak activity
Bloggers change their
activity over time
activity
3am
Low activity
p(n)
(e.g., daily, weekly, yearly)
Time n
Prakash 2015
47
Tail-part forecasts
• SpikeM can capture tail part
Prakash 2015
48
“What-if” forecasting
(1) First spike
(2) Release date
(3) Two weeks before release
?
?
e.g., given (1) first spike,
(2) release date of two sequel movies
(3) access volume before the release date
Prakash 2015
49
“What-if” forecasting
–SpikeM can forecast not only tail-part, but also rise-part!
(1) First spike
(2) Release date
(3) Two weeks before release
• SpikeM can forecast upcoming spikes
Prakash 2015
50
Bonus: Protest Predictions
Violent
Protest (VP)
[Sundereisan et al. ASONAM 2014]
[Jin et al. SIGKDD 2014]
• Can Twitter provide a lead time?
• South American twitter dataset
– Language: Spanish/Portuguese
– Idea
1. Look for trending keywords.
2. Predict event type for protest using SpikeM
parameters!
VP
A political tweet
Prakash 2015
P
Non Violent
Protest (P)
51
Outline
• Motivation
• Part 1: Learning Models (Empirical Studies)
– Q1: How to predict Flu-trends better?
– Q2: How does information evolve over time?
– Q3: How does malware attacks evolve over time?
• Conclusion
Prakash 2015
52
Modeling Malware Penetration
• Worldwide Intelligence Network
– Which machine got which malware
(or legitimate files)
– 1 Billion nodes
– 37 Billion edges
• Q: Temporal patterns?
Prakash 2015
53
Q: Temporal Patterns
Looks
familiar?

Prakash 2015
54
[Papalexakakis et. al. ASONAM 2013]
SpikeM again (or SharkFin)
7 parameters
only!
~ 400 points
Prakash 2015
~ 400 points
55
Latent Propagation Patterns
Prakash 2015
56
BUT
• Does not take into account differences
between detections vs actual infections.
Prakash 2015
57
[Chan et. al. WSDM 2016]
Domain-based approach: Data
• Looked at the entire 2 years of WINE data.
• Augmented with vulnerability and patch data
from NIST’s National Vulnerability Database
(NVD)
• Considered all machines from 40 countries –
study still ongoing. Considered the 50 most
commonly occurring malware.
Prakash 2015
58
Study Approach: Main Steps
Prakash 2015
59
Study Approach: Patch & Detection
Incompetence
• Incompetence : 4 base variables to measure hosts' incompetence in
detecting malware and incompetence in patching (absolute and relative)
w.r.t. various time period. How much time each host took in detecting or patching
for each malware
• For each time tick, we built a directed bipartite graph capturing normalized
detection/patching incompetence between malware and hosts
Prakash 2015
60
FBP Model
• Dependent variable: For each (c,m) pair, the % of
hosts in the country c attacked by malware m.
• Independent variables for each (c,m) pair:
– ADI, API, RDI, RPI, AADI, ARDI,AAPI, ARPI, ADA, RDA,
APA and RPA of hosts in country c, APH and RPH of
malware m
– Six similarity measures for hosts in two different
countries
– Per Capita GDP and HDI of countries
– Found k-nearest neighbors of each (c,m) pair
according to different similarity measures and used
features of those countries as well.
Prakash 2015
61
DIPS and DIPS-EXP
Model
• Infection rate 𝛽 𝑡 .
• Patching rates:
– Susceptible hosts: 𝜃(𝑡)
– Detected hosts: 𝛿(𝑡)
Developed algorithm to
learn best parameters for
DIPS and DIPS-Exp model by
minimizing error terms.
Prakash 2015
62
Learning DIPS, DIPS-Exp
Parameters
Prakash 2015
63
Ensemble Models
Prakash 2015
64
Experiments : Overall
•
•
•
We predict infection ratios of hosts in each country for each malware
Test all country-malware pairs for top 50 malware and top 40 GDP countries w.r.t.
# of infections
NRMSE is important because infections ratios over countries are very different
FBP shows better performance than FUNNEL
w.r.t. all performance measures
DIPS shows better performance than FBP
w.r.t. all performance measures
ESM0 is the best w.r.t. NRMSE
FUNNEL*: disease infection prediction model
FBP + FUNNEL does not work
The MAE* values were computed with
|# of ground true infected hosts – the expected # of
infected hosts|
Prakash 2015
65
Experiments
Prakash 2015
66
Summary of Forecasting
Experiments
• FBP, DIPS and ESM showed better performance
when there were lots of infection attempts.
• FBP showed reliable performance across the
board
• DIPS was very accurate when infectiousness level
is high
• ESM takes both advantages of FBP and DIPS and
shows very accurate and reliable performance
Prakash 2015
67
Outline
• Motivation
• Part 1: Learning Models (Empirical Studies)
– Q1: How to predict Flu-trends better?
– Q2: How does information evolve over time?
– Q3: How does malware attacks evolve over time?
• Conclusion
Prakash 2015
68
Future Plans
ANALYSIS
Understanding
POLICY/
ACTION
DATA
Large real-world
networks & processes
Managing
Prakash 2015
69
Scalability – Big Data
• Datasets of unprecedented scale
– High dimensionality and sample size!
• Need scalable algorithms for
– Learning Models
– Developing Policy
• Leverage parallel systems
– Map-Reduce clusters (like Hadoop) for data-intensive
jobs (more than 6000 machines)
– Parallelized compute-intensive simulations (like
Condor)
Prakash 2015
70
Uncertain Data in Cascade analysis
(more implementable policies)
Correcting for missing data
Original, Nodes
sampled off
Designing More Robust
Immunization Policies
Culprits, and missing
nodes filled in
Zhang and Prakash. CIKM
2014
Sundereisan, Vreeken, Prakash. 2014
Prakash 2015
71
Summarization
• Automatic segmentation?
ig. 6: M DSA S segmentation result for Peru: word clouds
he three
segments
• Segment
fludetected.
cascades?
…….
bola: M DSA S, EMP and TopicM all have a satisfactory
Q
alue (see Fig. 4b). As we explained in Sec. V-B, the l ⇤ va
earned by M DSA S is close to |X |, and p( x̃ i |y) ⇡ p(x j |
Prakash 2015
72
References
1.
2.
Scalable Vaccine Distribution in Large Graphs given Uncertain Data (Yao Zhang and B. Aditya Prakash) -- In CIKM 2014.
Fast Influence-based Coarsening for Large Networks (Manish Purohit, B. Aditya Prakash, Chahhyun Kang, Yao Zhang and V. S.
Subrahmanian) – In SIGKDD 2014
3.
4.
DAVA: Distributing Vaccines over Large Networks under Prior Information (Yao Zhang and B. Aditya Prakash) -- In SDM 2014
Fractional Immunization on Networks (B. Aditya Prakash, Lada Adamic, Jack Iwashnya, Hanghang Tong, Christos Faloutsos) – In
SDM 2013
Spotting Culprits in Epidemics: Who and How many? (B. Aditya Prakash, Jilles Vreeken, Christos Faloutsos) – In ICDM 2012,
Brussels Vancouver (Invited to KAIS Journal Best Papers of ICDM.)
Gelling, and Melting, Large Graphs through Edge Manipulation (Hanghang Tong, B. Aditya Prakash, Tina Eliassi-Rad, Michalis
Faloutsos, Christos Faloutsos) – In ACM CIKM 2012, Hawaii (Best Paper Award)
Rise and Fall Patterns of Information Diffusion: Model and Implications (Yasuko Matsubara, Yasushi Sakurai, B. Aditya Prakash,
Lei Li, Christos Faloutsos) – In SIGKDD 2012, Beijing
Interacting Viruses on a Network: Can both survive? (Alex Beutel, B. Aditya Prakash, Roni Rosenfeld, Christos Faloutsos) – In
SIGKDD 2012, Beijing
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
Winner-takes-all: Competing Viruses or Ideas on fair-play networks (B. Aditya Prakash, Alex Beutel, Roni Rosenfeld, Christos
Faloutsos) – In WWW 2012, Lyon
Threshold Conditions for Arbitrary Cascade Models on Arbitrary Networks (B. Aditya Prakash, Deepayan Chakrabarti, Michalis
Faloutsos, Nicholas Valler, Christos Faloutsos) - In IEEE ICDM 2011, Vancouver (Invited to KAIS Journal Best Papers of
ICDM.)
Times Series Clustering: Complex is Simpler! (Lei Li, B. Aditya Prakash) - In ICML 2011, Bellevue
Epidemic Spreading on Mobile Ad Hoc Networks: Determining the Tipping Point (Nicholas Valler, B. Aditya Prakash, Hanghang
Tong, Michalis Faloutsos and Christos Faloutsos) – In IEEE NETWORKING 2011, Valencia, Spain
Formalizing the BGP stability problem: patterns and a chaotic model (B. Aditya Prakash, Michalis Faloutsos and Christos
Faloutsos) – In IEEE INFOCOM NetSciCom Workshop, 2011.
On the Vulnerability of Large Graphs (Hanghang Tong, B. Aditya Prakash, Tina Eliassi-Rad and Christos Faloutsos) – In IEEE
ICDM 2010, Sydney, Australia
Virus Propagation on Time-Varying Networks: Theory and Immunization Algorithms (B. Aditya Prakash, Hanghang Tong,
Nicholas Valler, Michalis Faloutsos and Christos Faloutsos) – In ECML-PKDD 2010, Barcelona, Spain
MetricForensics: A Multi-Level Approach for Mining Volatile Graphs (Keith Henderson, Tina Eliassi-Rad, Christos Faloutsos,
Leman Akoglu, Lei Li, Koji Maruhashi, B. Aditya Prakash and Hanghang Tong) - In SIGKDD 2010, Washington D.C.
Prakash 2015
73
Acknowledgements
Collaborators
Deepayan Chakrabarti,
Hanghang Tong,
Kunal Punera,
Ashwin Sridharan,
Sridhar Machiraju,
Mukund Seshadri,
Alice Zheng,
Lei Li,
Polo Chau,
Nicholas Valler,
Alex Beutel,
Xuetao Wei
Christos Faloutsos
Roni Rosenfeld,
Michalis Faloutsos,
Lada Adamic,
Theodore Iwashyna (M.D.),
Dave Andersen,
Tina Eliassi-Rad,
Iulian Neamtiu,
Varun Gupta,
Jilles Vreeken,
V. S. Subrahmanian
John Brownstein (M.D.)
Prakash 2015
74
Acknowledgements
• Students
Liangzhe Chen
Shashidhar Sundereisan
Benjamin Wang
Yao Zhang
Sorour Amiri
Prakash 2015
75
Acknowledgements
Funding
Prakash 2015
76
Making Diffusion Work
for You
B. Aditya Prakash
http://www.cs.vt.edu/~badityap
Analysis
Policy/Action
Prakash 2015
Data
77
Download