1 DE-ANONYMIZING SOCIAL NETWORKS 8 April 2015 Ilan Komargodski Foundations of Privacy Overview Introduction Related Work Model and Definitions The Attack Experiments Conclusion Introduction Why Publish Social Networks Data? (Why even bother?) Reaserch - academic and government data-mining: Sociology Economics Epidemiology Security Advertising partners Third-party applications Aggregation projects What is a Social Network Graph? Social network graph consists of nodes (individuals), edges (relations) and information associated with each node and edge. Attributes attached to nodes are very sensitive. Credit card number Note: The existence of an edge between two nodes can also be very sensitive. Had a romantic relationship Name Social Network Operator Needs to publish network information that doesn’t reveal any “personal data”. Anonymization Anonymity Privacy W25LO YCX38 3TCB2 BZDU2 41OCD A9DEY O0Y6W TIPN8 8C5AR USFMI Related Work Related Work Backstorm present two active attacks on edge privacy assuming that the adversary is able to change the network before its release. Creates O(logN) nodes that re-identify O(log2N) users. Disadvantages: [2] Limited to online social networks (even here difficult) No control on incoming edges Easy to detect (pattern stands out) Models and Definitions Data Release Model Select subset of nodes Vsan V and subsets X san X , Ysan Y of node and edge attributes to be released Compute the induced subgraph on Vsan Remove some edges and add fake edges (perturbation) Summary: a sanitized subset of nodes and edges with the corresponding attributes. The Attacker Model Many types of attackers (government, marketing, spammers etc) Usually have access to different network Saux whose membership partially overlaps with S Might be extracted from S (automatic crawling, malicious third-party application) Aggregation projects Collude with an operator of a different network S Saux Saux S S Saux Aux. Information Def. Saux: a graph Gaux={Vaux,Eaux} and a set of probability distributions AuxX and AuxY , one for each attribute of every node in Vaux and each attribute of every edge in Eaux. For example, P[X = “friendship”] = 0.8 P[X = “contact”] = 0.2 Seeds: attacker possesses detailed information about a negligible number of members of S. How? Re-identification Algorithm Def. Re-identification algorithm - a probabilistic mapping µ:Vsan×Vaux → [0, 1] where µ(vaux, vsan) is the probability that vaux is mapped to vsan. Privacy breach – for node vaux in Vaux let µREAL(vaux)=vsan. We say that the privacy of vsan is breached w.r.t adversary Adv and privacy parameter δ, if for some attribute X, that is private: Adv[X,vaux,X[vaux]] – Aux[X, vaux, X[vaux]] > δ Success of an Attack Fraction of nodes re-identified. Results in a fairly meaningless metric. Weight proportional to importance in the network . E.g degree centrality, where each node is weighted in proportion to its degree. De-Anonymization Algorithm Seed Identification TARGET AUXILIARY Input: 1. A clique of k nodes known to be common to both graphs. 2. Degree of each of these nodes (AUXILIARY) 3. Number of common neighbors for each pair of nodes (AUXILIARY) Complexity: Exponential in k. BUT: 1. If the degree is bounded by d, then the complexity is O(nd(k-1)). 2. Heavily input dependant. (High running time => Large number of matches) Algorithm: Search for a unique k-clique (on TARGET) that has: 1. matching degrees (with some error factor) 2. common neighbor counts Outputs µs, a partial mapping. Propagation Algorithm Input: two graphs and a partial seed mapping Output: deterministic 1-1 mapping Each iteration starts with the accumulated mapping. Picks an unmapped node u in V1 and computes a score for each unmapped node v in V2 score Important: The algorithm finds new mappings using the topological structure of the network and the feedback from previous mappings. Propagation Algorithm Cont. Eccentricity measures how much an item in a set X “stands out” from the rest. Defined by: max(X ) max ( X ) standard deviation (X ) Edge Directionality – mapping scores for nodes u and v are computed separately for incoming and outgoing edges (and then summed). Node Degrees – the above works in favor of high-degree cosine similarity nodes => divide by square root of degree. Revisiting Nodes – as the algorithm progresses, #mapped nodes increases & errors decrease. Reverse Match – every match is matched in both directions 2 func eccentricity(items): return (max(items) - max2(items)) / std_dev(items) func update_scores_by_direction(lgraph, rgraph, mapping, lnode, scores, direction = ' → ' or ' ← ') for edge (lnbr direction lnode) in lgraph.edges: if lnbr not in mapping: continue rnbr = mapping[lnbr] Node for edge (rnbr direction rnode) in rgraph.edges: Degrees if rnode in mapping.image: continue scores[rnode] += (1 / rnode.direction_degree) ^ 0.5 func update_scores(lgraph, rgraph, mapping, lnode): scores = [0 for rnode in rgraph.nodes] # initialize scores update_scores_by_direction(lgraph, rgraph, mapping, lnode, scores, ' → ') # in update_scores_by_direction(lgraph, rgraph, mapping, lnode, scores, ' ← ') # out return scores Edge func find_match(lgraph, rgraph, mapping, node, theta): Directionality scores[node] = update_scores(lgraph, rgraph, mapping, node) if eccentricity(scores[node]) < theta: return NULL Eccentricity return the node from rgraph.nodes that maximizes scores[node] Reverse func propagation_step(lgraph, rgraph, mapping): Match Revisiting Nodes for lnode in lgraph.nodes: rnode = find_match(lgraph, rgraph, mapping, lnode, theta) reverse_match = find_match(rgraph, lgraph, invert(mapping), rnode, theta) if rnode != NULL and reverse_match == lnode: mapping[lnode] = rnode while not convergence do: propagation_step(lgraph, rgraph, seed_mapping) Complexity Without revisiting nodes and reverse matches O(|E1|d2) Revisiting: assuming that a node v is revisited only if then number of already-mapped neighbors of v has increased: O(|E1|d1d2) Reverse mapping: O((|E1| + |E2|)d1d2) Experiments Experiments Type Network Relation . Nodes Edges Av. Deg Crawled Target Twitter Follow 224K 8.5M 27.7 2007 Auxiliary Flickr Contact 3.3M 53M 32.2 2007/8 In both API exposes: Mandatory username Optional name Optional location Ground Truth Based on exact match in username/name, or the score δ(name, username, location) is high enough. RESULT: 27,000 mappings, called µ(G). Seed is 150 pairs of randomly selected mappings from µ(G) each of degree > 80 in Gaux . Seed Identification Node Overlap: 25% Edge Overlap: 50% Node Re-Identification – Effect of Noise Node Overlap: 25% #Seeds: 50 Experiment Results 30.8% of the mappings were re-identified correctly. 57% were not identified 12.1% were identified incorrectly 41% of them were mapped to distance 1 nodes from the true mapping 55% of them were mapped to nodes with the same geographic location 27% are completely erroneous Conclusion In reality anonymized graphs are released with some attributes => de-anonymization is even easier! Social networks grow overlap between social networks grows Auxiliary info is much richer Anonymity is not sufficient for privacy when dealing with social networks! Questions and Discussion Thank You! Bibliography 1. Arvind Narayanan and Vitaly Shmatikov De-anonymizing Social Networks 2. Lars Backstrom, Cynthia Dwork and Jon M. Kleinberg Wherefore art thou r3579x?: Anonymized social networks, hidden patterns, and structural steganography