Wherefore Art Thou R3579X? Anonymized y Social Networks, Hidden Patterns and Structural Steganegraphy L Lars B k t Backstrom, C thi Dwork, Cynthia D k Jon J Kleinberg Kl i b 2010‐2‐9 1 Anonymized Networks 1. For researchers, names are meaningless 2. Users also need privacy 2010‐2‐9 2 Anonymized Network(cont Network(cont’)) Release just a big random labeled graph with millions of edges! No other conceptual data released! 2010‐2‐9 3 Attack Scenario 1. Attacker may add a few new nodes prior to release. - Would like to add as few as possible 2. Attacker may add edges from these few nodes to any nodes in the network - May M link li k tto any person whose h ‘‘name’’ h he knows k 3. Goal of the attack is to discover if two named users are connected, i.e. their connections 2010‐2‐9 4 AttackScenario(cont Structure ) Attack Scenario(cont’) Sam Bob Lily 2010‐2‐9 Before release of the network, the attacker finds some targeted users ( m d users) (named s s) 5 Attack Structure(cont Structure(cont’)) Sam H Bob Lily Attacker 1 creates a few new nodes 1. 2. adds edges g among g them and edges from the new nodes to the named nodes 2010‐2‐9 6 Attack Structure(cont Structure(cont’)) G H Network is released to public! 2010‐2‐9 7 Attack Structure(cont Structure(cont’)) G H Attacker locates the h inserted dH 2010‐2‐9 8 Attack Structure(cont Structure(cont’)) G Sam Follow F ll links li k to t identify named users Bob Lily 2010‐2‐9 9 Attack Structure(cont Structure(cont’)) G Follow F ll links li k to t identify named users Attacker then determines which pairs are connected! Done! 2010‐2‐9 10 How can the attacker locates th inserted the i t d H 2010‐2‐9 11 Walk-Based Attack 1 Attacker chooses k named users 1. W = {w1,w2, . . . ,wk} 2010‐2‐9 12 Walk-Based Attack 1 Attacker chooses k named users 1. W = {w1,w2, . . . ,wk} 2. Creates k new nodes X = {x1 {x1, x2 x2, . . . , xk} H Size k 2010‐2‐9 13 Walk-Based Attack 1 Attacker chooses k named users 1. W = {w1,w2, . . . ,wk} 2. Creates k new nodes X = {x1 {x1, x2 x2, . . . , xk} H 2010‐2‐9 3 Creates edges (xi,wi) 3. (xi wi) 14 Walk-Based Attack 1 Attacker chooses k named users 1. W = {w1,w2, . . . ,wk} 2. Creates k new nodes X = {x1 {x1, x2 x2, . . . , xk} H 3 Creates edges (xi,wi) 3. (xi wi) 4. Creates each edge (xi, xi+1) – Hamiltonian Path g with p probability y 0.5 and each other edge 2010‐2‐9 15 G Is the attacker able to locate the inserted H now 2010‐2‐9 16 G This might not work! 2010‐2‐9 17 G Why this might not work? 2010‐2‐9 18 G Why this might not work? More than one matches! H? 2010‐2‐9 No way to differentiate! 19 Success Conditions U i Uniqueness G H S - No S != X such that G[S] and G[X] = H are isomorphic - H has no non-trivial non trivial automorphisms Intuitively, Isomorphism: without labeling, two graphs are actually the same Automorphism: a graph has internal symmetry (relabeling its nodes preserves graph’s structure) 2010‐2‐9 20 Success Conditions(cont Conditions(cont’)) U i Uniqueness G H S - No S != X such that G[S] and G[X] = H are isomorphic - H has no non-trivial non trivial automorphisms Recoverability - Find H efficiently, given G 2010‐2‐9 21 Uniqueness Proof Intuition: Intuition: - Erdös - Erdös E d showed h d that h a random d graph h will ll not contain certain non-random subgraphs g p - We need to show that a non-random graph will not have a certain random subgraph 2010‐2‐9 22 Uniqueness Proof P Proof f technique: t h i 1 Count the number of 1. size k subgraphs in G’ = G − H 2010‐2‐9 23 Uniqueness Proof P Proof f technique: t h i 1 Count the number of 1. size k subgraphs in G’ = G − H 2. Find probability that each one is a match 2010‐2‐9 24 Uniqueness Proof P Proof f technique: t h i 1 Count the number of 1. size k subgraphs in G’ = G − H 2. Find probability that each one is a match 3. Take union-bound to show that there is no match with high probability 2010‐2‐9 25 Uniqueness Proof(cont Proof(cont’)) 1 In G’ 1. G , number of labeled subgraph of size k, k 2. Probability that a particular labeled subgraph matches H is no more m m than 3. Probability that there is at least a match in these 3 subgraphs is (use union bound) It goes to 0 exponentially quickly in k ! 2010‐2‐9 26 Uniqueness Proof(cont Proof(cont’) ) v Intuition: 1. k is small, high g probability p y that there’s match in G’; 2 k iis llarge, low 2. l probability. b bilit But we want k to be relatively small, so we have to find a Balance here! We choose 2010‐2‐9 for a small constant 27 Uniqueness Proof(cont Proof(cont’)) Showed that G’ has no copies p of H (with high probability). However, attaching H to G’ may make a copy in G. 2010‐2‐9 28 Uniqueness Proof(cont Proof(cont’)) Replace a few nodes from H -> nodes from G’ must have correct connections to H -> Unlikely 2010‐2‐9 29 Uniqueness Proof(cont Proof(cont’)) Replace many nodes from H ->> internal connections of nodes must be correct as well -> More Unlikely Analysis uses similar techniques, but is more complex! 2010‐2‐9 30 How to find the unique H in G? 2010‐2‐9 31 How to find the unique H in G? R Recovery Algorithm l h Brute force with pruning 2010‐2‐9 32 Recovery Algorithm 1. Start from root, pick all the nodes that have the degree g of x1, to be root’s children 2010‐2‐9 33 Recovery Algorithm 1. Start from root, pick all the nodes that have the degree g of x1, to be root’s children 2. Find neighbors of nodes in level 1, which have degree g of x2, to be their corresponding children 2010‐2‐9 34 Recovery Algorithm 1. Start from root, pick all the nodes that have the degree g of x1, to be root’s children 2. Find neighbors of nodes in level 1, which have degree g of x2, to be their corresponding children 3 Continue 3. C ti until til llevell of fk k. If there’s th ’ only l one path th from root, success; Otherwise, H is not unique. 2010‐2‐9 Try once more! 35 Recovery Runtime Maximum degree of H - O(log n) keeps number of feasible paths relatively y small since H is small Analysis more complex, but in expectation, t ti th the number b of f feasible paths: Th Thus, total t t l number b of f paths th is i only l slightly li htl superlinear superlinear, li and so we can find H efficiently ! 2010‐2‐9 36 Experimental Results - Simulated attack on LiveJournal – 4.4 million nodes,, 77 million m edges g - Constructed H with degrees randomly selected in [10, 20] or [20, [20 60], 60] while varying k 2010‐2‐9 37 Experimental Results -With 7 nodes, nodes d in [20, 60], successful over 90% 90%; compromises an average of 70 users -Recovery typically takes less than a second ; size of search tree about 90,000 2010‐2‐9 38 Comparison of Attack Walk-Based Attack Walk- Adds new nodes - Can compromise nodes at most - Simple to execute, difficult to detect Cut-Based Attack (Using Gomory CutGomory--Hu tree) - Adds new nodes – theoretical minimum for any attack tt k of f thi this style t l - Attacks nodes - Easy for curator to detect on real data - Both attacks work no matter how many y people p p execute them 2010‐2‐9 39 Conclusion - Doesn’t apply pp y to networks that are already y public p - Cannot rely on anonymization to ensure privacy in N t Networks k - Further directions 1) Design a general interactive method whereby researchers may y make queries q about the network 2) Non-interactive methods where the released data is perturbed in such a way that it is still useful to researchers, researchers but provably anonymized 2010‐2‐9 40 Thank you! Thank you! 2010‐2‐9 41