ppt

Resisting Structural Re-identification in Anonymized Social Networks Michael Hay, Gerome Miklau, David Jensen, Don Towsley, Philipp Weis University of Massachusetts Amherst Session : Privacy & Authentication, VLDB 2008 2011-01-21 Presented by Yongjin Kwon Outline  Introduction  Adversary Knowledge Models  Vertex Refinement Queries  Subgraph Queries  Hub Fingerprint Queries  Disclosure in Real Networks  Anonymity in Random Graphs  Graph Generalization for Anonymization  Conclusion Copyright  2011 by CEBT 2 Introduction  There are a large amount of data in various storages.  Supermarket Transactions  Web Sever Logs  Sensor Data  Interactions in Social Networks  Email, Twitter  …  Data owners publish sensitive information to facilitate research.  Reveal as much important information as possible while preserving the privacy of the individuals in the data.  In personal data, analysts may find valuable information. Copyright  2011 by CEBT 3 Introduction (Cont’d)  A Face Is Exposed for AOL Searcher No. 4417749 [New York Times, August 9, 2006]  AOL collected 20 million Web search queries and published them.  Although the company naïvely anonymized the data, the identity of AOL user “No. 4417749” revealed: “Thelma Arnold, a 62-yearold widow who lives in Lilburn, Ga., frequently researches her friends’ medical ailments and loves her three dogs.”  Serious problem of privacy risks! Copyright  2011 by CEBT 4 Introduction (Cont’d)  Potential privacy risks in network data  Risk network structure in the early epidemic phase of HIV transmission in Colorado Springs [Sexually Trans. Infections, 2002] –  A social network, which represents a set of individuals related by sexual contacts and shared drug injections, is published in order to analyze how HIV spreads. Enron Email Dataset (http://www.cs.cmu.edu/~enron/) – The email collection was released for investigation. – It is the only “real” email collection due to the privacy issues. Copyright  2011 by CEBT 5 Introduction (Cont’d)  Attacks on (naïvely anonymized network data)  Wherefore Art Thou R3579X? Anonymized Social Networks, Hidden Patterns, and Structural Steganography [WWW 2007]  Active Attack  – An adversary chooses a set of targets, creates a small number of fake nodes with edges to these targets, and construct a highly identifiable pattern of links among the new nodes. – After the network is released, the adversary can recognize the pattern and fake nodes, and reveal the sensitive information of targets. Passive Attack – Most vertices in network data usually belong to a small uniquely identifiable subgraph. – An adversary may collude with other friends to identify additional nodes connected to the distinct subset of the coalition. Copyright  2011 by CEBT 6 Introduction (Cont’d)  An adversary may compromise privacy of some victims with some (structural) background knowledge.  The naïve anonymization is NOT sufficient!  A new way of resisting malicious actions to re-identify the identity of each individual in a published network data must be proposed.  Need to think of…  Types of adversary knowledge  Theoretical approach of privacy risks  A way of preserving privacy while maintaining high utility of data Copyright  2011 by CEBT 7 Adversary Knowledge Models  The adversary’s background knowledge is modeled as “correct” answers to a restricted knowledge query.  The adversary uses the query to refine the feasible candidate set.  Three knowledge models  Vertex Refinement Queries  Subgraph Queries  Hub Fingerprint Queries Copyright  2011 by CEBT 8 Vertex Refinement Queries  These queries report on the local structure of the graph around the “target” node. B Degree of B Degrees of neighbors of B Copyright  2011 by CEBT 9 Vertex Refinement Queries (Cont’d)  Relative Equivalence  A B D F  C E G H If the adversary knows the answer of , then G can be quickly re-identified in the anonymized graph! Copyright  2011 by CEBT 10 Subgraph Queries  Two drawbacks of vertex refinement queries  Always return “correct” information.  Depend on the degree of the target node.  These queries assert the existence of a subgraph around the “target” node. B  B B B Edge Facts : 3 4 5 Assume that the adversary knows the number of edge facts around the target node. Copyright  2011 by CEBT 11 Hub Fingerprint Queries  A hub is a node with high degree and high betweenness centrality.  Hubs are easily re-identified by an adversary.  A hub fingerprint for a node is a vector of distances from observable hub connections. A Hub B D F C E G H Closed World : Not reachable within distance 1 Open World : Incomplete knowledge Copyright  2011 by CEBT 12 Disclosure in Real Networks  Experiments for the impact of external information  Three networked data set – Hep-Th : co-author graphs, taken from the arXiv archive – Enron : “real” email dataset, collected by the CALO Project – Net-trace : IP-level network trace collected at a major university  Consider each node in turn as a target.  Compute the candidate set for the target. –  Smaller candidate set : more vulnerable! Characterize how many nodes are protected and how many are re-identifiable. Copyright  2011 by CEBT 13 Disclosure in Real Networks (Cont’d)  Vertex Refinement Queries Copyright  2011 by CEBT 14 Disclosure in Real Networks (Cont’d)  Subgraph Queries  Two Strategies to build subgraphs – Sampled Subgraph – Degree Subgraph Copyright  2011 by CEBT 15 Disclosure in Real Networks (Cont’d)  Hub Fingerprint Queries  Hub : five highest degree nodes (Enron), ten highest degree nodes (Hep-Th, Net-trace) Copyright  2011 by CEBT 16 Anonymity in Random Graphs  Theoretical approach of privacy risk with random graphs  Erdős-Rényi Model (ER Model) with n nodes and edge connection probability p. –  Asymptotic analysis of robustness against  Sparse ER Graphs : robust against  Dense ER Graphs : robust against  Super-dense ER Graphs : vulnerable against Copyright  2011 by CEBT knowledge attack for any , but vulnerable against 17 Anonymity in Random Graphs (Cont’d)  Anonymity Against Subgraph Queries  Depends on the number of nodes in the largest clique  If  The clique number is a useful lower bound on the disclosure. for a subgraph query , then  Random Graphs with Attributes Copyright  2011 by CEBT 18 Graph Generalization for Anonymization  Generalize a naïvely-anonymized graph.  Much uncertainty! (measured by the number of possible world)  Find the partitioning that maximizes the likelihood while satisfying that the size of a supernode is larger than k.  Apply the simulated annealing method to find the partitioning. Copyright  2011 by CEBT 19 Graph Generalization for Anonymization (Cont’d)  How to analyze the generalized graph?  Construct the synthetic graph using the tagged information.  Perform standard graph analysis on this synthetic graph. Copyright  2011 by CEBT 20 Graph Generalization for Anonymization (Cont’d)  How does graph generalization affect network properties?  Examine five properties on the three real-world networks. – Degree – Path Length – Transitivity (Clustering Coefficient) – Network Resilience – Infectiousness  Perform the experiments on the 200 synthetic graphs.  Repeat for each . Copyright  2011 by CEBT 21 Graph Generalization for Anonymization (Cont’d) Copyright  2011 by CEBT 22 Conclusion  Three contributions  Formalize models of adversary knowledge.  Provide a start point of theoretical study of privacy risks on a network data.  Introduce a new anonymization technique by generalizing the original graph. Copyright  2011 by CEBT 23

ppt

Related documents

Products

Support

ppt

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib