Resisting Structural Re-identification in Anonymized Social Networks Michael Hay, Gerome Miklau, David Jensen, Don Towsley, Philipp Weis University of Massachusetts Amherst Session : Privacy & Authentication, VLDB 2008 2011-01-21 Presented by Yongjin Kwon Outline Introduction Adversary Knowledge Models Vertex Refinement Queries Subgraph Queries Hub Fingerprint Queries Disclosure in Real Networks Anonymity in Random Graphs Graph Generalization for Anonymization Conclusion Copyright 2011 by CEBT 2 Introduction There are a large amount of data in various storages. Supermarket Transactions Web Sever Logs Sensor Data Interactions in Social Networks Email, Twitter … Data owners publish sensitive information to facilitate research. Reveal as much important information as possible while preserving the privacy of the individuals in the data. In personal data, analysts may find valuable information. Copyright 2011 by CEBT 3 Introduction (Cont’d) A Face Is Exposed for AOL Searcher No. 4417749 [New York Times, August 9, 2006] AOL collected 20 million Web search queries and published them. Although the company naïvely anonymized the data, the identity of AOL user “No. 4417749” revealed: “Thelma Arnold, a 62-yearold widow who lives in Lilburn, Ga., frequently researches her friends’ medical ailments and loves her three dogs.” Serious problem of privacy risks! Copyright 2011 by CEBT 4 Introduction (Cont’d) Potential privacy risks in network data Risk network structure in the early epidemic phase of HIV transmission in Colorado Springs [Sexually Trans. Infections, 2002] – A social network, which represents a set of individuals related by sexual contacts and shared drug injections, is published in order to analyze how HIV spreads. Enron Email Dataset (http://www.cs.cmu.edu/~enron/) – The email collection was released for investigation. – It is the only “real” email collection due to the privacy issues. Copyright 2011 by CEBT 5 Introduction (Cont’d) Attacks on (naïvely anonymized network data) Wherefore Art Thou R3579X? Anonymized Social Networks, Hidden Patterns, and Structural Steganography [WWW 2007] Active Attack – An adversary chooses a set of targets, creates a small number of fake nodes with edges to these targets, and construct a highly identifiable pattern of links among the new nodes. – After the network is released, the adversary can recognize the pattern and fake nodes, and reveal the sensitive information of targets. Passive Attack – Most vertices in network data usually belong to a small uniquely identifiable subgraph. – An adversary may collude with other friends to identify additional nodes connected to the distinct subset of the coalition. Copyright 2011 by CEBT 6 Introduction (Cont’d) An adversary may compromise privacy of some victims with some (structural) background knowledge. The naïve anonymization is NOT sufficient! A new way of resisting malicious actions to re-identify the identity of each individual in a published network data must be proposed. Need to think of… Types of adversary knowledge Theoretical approach of privacy risks A way of preserving privacy while maintaining high utility of data Copyright 2011 by CEBT 7 Adversary Knowledge Models The adversary’s background knowledge is modeled as “correct” answers to a restricted knowledge query. The adversary uses the query to refine the feasible candidate set. Three knowledge models Vertex Refinement Queries Subgraph Queries Hub Fingerprint Queries Copyright 2011 by CEBT 8 Vertex Refinement Queries These queries report on the local structure of the graph around the “target” node. B Degree of B Degrees of neighbors of B Copyright 2011 by CEBT 9 Vertex Refinement Queries (Cont’d) Relative Equivalence A B D F C E G H If the adversary knows the answer of , then G can be quickly re-identified in the anonymized graph! Copyright 2011 by CEBT 10 Subgraph Queries Two drawbacks of vertex refinement queries Always return “correct” information. Depend on the degree of the target node. These queries assert the existence of a subgraph around the “target” node. B B B B Edge Facts : 3 4 5 Assume that the adversary knows the number of edge facts around the target node. Copyright 2011 by CEBT 11 Hub Fingerprint Queries A hub is a node with high degree and high betweenness centrality. Hubs are easily re-identified by an adversary. A hub fingerprint for a node is a vector of distances from observable hub connections. A Hub B D F C E G H Closed World : Not reachable within distance 1 Open World : Incomplete knowledge Copyright 2011 by CEBT 12 Disclosure in Real Networks Experiments for the impact of external information Three networked data set – Hep-Th : co-author graphs, taken from the arXiv archive – Enron : “real” email dataset, collected by the CALO Project – Net-trace : IP-level network trace collected at a major university Consider each node in turn as a target. Compute the candidate set for the target. – Smaller candidate set : more vulnerable! Characterize how many nodes are protected and how many are re-identifiable. Copyright 2011 by CEBT 13 Disclosure in Real Networks (Cont’d) Vertex Refinement Queries Copyright 2011 by CEBT 14 Disclosure in Real Networks (Cont’d) Subgraph Queries Two Strategies to build subgraphs – Sampled Subgraph – Degree Subgraph Copyright 2011 by CEBT 15 Disclosure in Real Networks (Cont’d) Hub Fingerprint Queries Hub : five highest degree nodes (Enron), ten highest degree nodes (Hep-Th, Net-trace) Copyright 2011 by CEBT 16 Anonymity in Random Graphs Theoretical approach of privacy risk with random graphs Erdős-Rényi Model (ER Model) with n nodes and edge connection probability p. – Asymptotic analysis of robustness against Sparse ER Graphs : robust against Dense ER Graphs : robust against Super-dense ER Graphs : vulnerable against Copyright 2011 by CEBT knowledge attack for any , but vulnerable against 17 Anonymity in Random Graphs (Cont’d) Anonymity Against Subgraph Queries Depends on the number of nodes in the largest clique If The clique number is a useful lower bound on the disclosure. for a subgraph query , then Random Graphs with Attributes Copyright 2011 by CEBT 18 Graph Generalization for Anonymization Generalize a naïvely-anonymized graph. Much uncertainty! (measured by the number of possible world) Find the partitioning that maximizes the likelihood while satisfying that the size of a supernode is larger than k. Apply the simulated annealing method to find the partitioning. Copyright 2011 by CEBT 19 Graph Generalization for Anonymization (Cont’d) How to analyze the generalized graph? Construct the synthetic graph using the tagged information. Perform standard graph analysis on this synthetic graph. Copyright 2011 by CEBT 20 Graph Generalization for Anonymization (Cont’d) How does graph generalization affect network properties? Examine five properties on the three real-world networks. – Degree – Path Length – Transitivity (Clustering Coefficient) – Network Resilience – Infectiousness Perform the experiments on the 200 synthetic graphs. Repeat for each . Copyright 2011 by CEBT 21 Graph Generalization for Anonymization (Cont’d) Copyright 2011 by CEBT 22 Conclusion Three contributions Formalize models of adversary knowledge. Provide a start point of theoretical study of privacy risks on a network data. Introduce a new anonymization technique by generalizing the original graph. Copyright 2011 by CEBT 23