Maximizing the Spread of Influence through a Social Network David Kempe, Jon Kleinberg, Eva Tardos Cornell University KDD 2003 Social network and spread of influence • Social network spreads INFLUENCE among its members – Opinions, ideas, information … • “Word-of-mouth” effect in Viral Marketing Motivating scenarios 1. Adoption of a new drug by doctors and patients How to reach many patients? 2. Adoption of a new book by profs and students How to reach many students? 3. Bloggers blogging and publishing weblogs Follow which blogger to get the most information? 4. Battle of Water Sensor Networks How to find the optimize sensor placement? Problem setting • Given – A limited budget B for initial advertising – Influence estimates between individuals • Goal – Trigger a large cascade of influence • Question – Which set of individuals should we target? What do we have in this paper? • Form models of influence in social networks • Obtain data about particular network (inter-personal influence estimating) • Devise algorithm to maximize spread of influence Models of influence • First mathematical models – [Schelling ‘70/’78], [Granovetter ‘78] • Large body of subsequent work – [Rogers ‘95], [Valente ‘95], [Wasserman/Faust ‘94] • Two basic classes of diffusion models: threshold and cascade • General operational view: – A social network is a directed graph, each person (individual) is a node – Nodes start either active or inactive – An active node may trigger activation of neighboring nodes – Monotonicity assumption Linear threshold model • A node 𝑣 has a random threshold 𝜃𝑣 ∈ [0,1] • A node 𝑣 is influenced by each neighbor 𝑤 according to a weight 𝑏𝑣,𝑤 such that : 𝑤 𝑏𝑣,𝑤 ≤ 1 • A node 𝑣 becomes active when at least (weighted) 𝜃𝑣 fraction of its neighbors are active: 𝑤 (𝑎𝑐𝑡𝑖𝑣𝑒) 𝑏𝑣,𝑤 ≥ 𝜃𝑣 Example Inactive Node 0.6 0.3 Active Node 0.2 X Threshold 0.2 Active neighbors 0.1 0.4 U 0.5 w 0.3 0.5 Stop! 0.2 v Independent cascade model • When node 𝑣 becomes active, it has a single chance of activating each currently inactive neighbor 𝑤 • The activation attempt succeeds with independent probability 𝑝𝑣𝑤 Example 0.6 Inactive Node 0.3 0.2 X 0.4 0.5 w 0.2 U 0.1 0.3 0.2 0.5 v Stop! Active Node Newly active node Successful attempt Unsuccessful attempt Influence maximization problem • Influence of a node set 𝑆: 𝑓 𝑆 – Expected number of active nodes at the end, if set 𝑆 is the initial active set • Problem: – Given a parameter 𝑘 (budget), find a 𝑘-node set 𝑆 to maximize 𝑓(𝑆) – Constrained optimization problem with 𝑓(𝑆) as the objective function Properties of 𝑓(𝑆) • Non-negative (obviously) • Monotone: 𝑓(𝑆 + 𝑣) ≥ 𝑓(𝑆) • Submodular: – Let 𝑁 be a finite set – A set function 𝑓: 2𝑁 → ℝ is submodular iff ∀𝑆 ⊂ 𝑇 ⊂ 𝑁, ∀𝑣 ∈ 𝑁\𝑇, 𝑓 𝑆 + 𝑣 − 𝑓 𝑆 ≥ 𝑓 𝑇 + 𝑣 − 𝑓(𝑇) Bad news • For a submodular function 𝑓, if 𝑓 only takes nonnegative values, and is monotone, finding a 𝑘-element set 𝑆 for which 𝑓(𝑆) is maximized in an NP-hard optimization problem. • It is NP-hard to determine the optimum for influence maximization for both independent cascade model and linear threshold model. Good news • We can use Greedy algorithm – Start with an empty set 𝑆 – For 𝑅 iterations: • Add node 𝑣 to 𝑆 that maximizes 𝑓(𝑆 + 𝑣) − 𝑓(𝑆) • How good (bad) is it? – Theorem: The greedy algorithm is a (1 − 1/𝑒) approx. – The resulting set 𝑆 activated at least (1 − 1/𝑒) > 63% of the number of nodes that any size-𝑘 set 𝑆 could activate Greedy algorithm Other heuristics to find 𝑆 • High-degree – Picks 𝑘 nodes with highest node degree 𝑑𝑣 • Distance centrality – Picks 𝑘 nodes with lowest average distance to other nodes in the network • Random – Randomly pick 𝑘 nodes Experiment setup • Co-authorship network from physics section of arXiv.org • A node is an author • A link is a co-authored paper (𝑐𝑢,𝑣 links) • LT model: The edge 𝑣 → 𝑢 has weight 𝑐𝑢,𝑣 𝑑𝑢 • IC model: – 𝑝 = 1% − 10% – The edge 𝑣 → 𝑢 has prob. 1 − (1 − 𝑝)𝑐𝑢,𝑣 Experiment result on IC model • Result on LT model is similar • Not sensitive to different algorithms at high 𝑝 Cost-effective Outbreak Detection in Networks Jure Leskovec, Andreas Krause, Carlos Guestrin, Christos Faloutsos, Jeanne VanBriesen, Natalie Glance Carnegie Mellon University KDD 2007 Original Greedy Inefficient!!! Redundant!!! 15,000 nodes takes a few days to complete Complexity 𝑂(𝑘𝑛𝑅𝑚) Submodularity property • Recall: ∀𝑆 ⊂ 𝑇 ⊂ 𝑁, ∀𝑣 ∈ 𝑁\𝑇, 𝑓 𝑆 + 𝑣 − 𝑓 𝑆 ≥ 𝑓 𝑇 + 𝑣 − 𝑓(𝑇) • When adding a vertex v to seed set S, the gain of adding v is larger if S is smaller • Therefore: a large number of nodes to not need to be re-evaluate CELF algorithm 𝑟 If 𝑠𝑣𝑟−1 < 𝑠𝑣′ then discard 𝑣 700 times faster than the original greedy!!! Efficient Influence Maximization in Social Networks Wei Chen, Yajun Wang, Siyu Yang Microsoft Research, Tsinghua University KDD 2009 Improved greedy • Construct a graph 𝐺’ • Obtain 𝐺’ by removing edges not for propagation from 𝐺 with prob. (1 − 𝑝) • Use DFS/BFS to find out the set of vertices reachable from 𝑆 in 𝐺’ (𝑅𝐺’(𝑆)) • Also obtain 𝑅𝐺’ 𝑣 , ∀𝑣 ∈ 𝑉\S • Remove overlapping elements 𝑅𝐺’ 𝑣 ∩ 𝑅𝐺’ 𝑆 = ⊘ Improved greedy 15-34% faster than the original greedy!!! Mix with CELF • Cons – CELF must consider all vertices to be added in the first round, but then we can decreased in future rounds – Improved greedy must build G’ for R times • Mix – First vertex: use Improved greedy – Other vertices: use CELF