DE-ANONYMIZING SOCIAL NETWORKS

advertisement
1
DE-ANONYMIZING SOCIAL
NETWORKS
8 April 2015
Ilan Komargodski
Foundations of Privacy
Overview






Introduction
Related Work
Model and Definitions
The Attack
Experiments
Conclusion
Introduction
Why Publish Social Networks Data?
(Why even bother?)

Reaserch - academic and government data-mining:
 Sociology
 Economics
 Epidemiology
 Security



Advertising partners
Third-party applications
Aggregation projects
What is a Social Network Graph?
Social network graph consists of nodes (individuals),
edges (relations) and information associated with
each node and edge.
Attributes attached to nodes are very sensitive.
Credit card
number
Note: The existence of an
edge between two nodes
can also be very sensitive.
Had a romantic
relationship
Name
Social Network Operator
Needs to publish network information that doesn’t
reveal any “personal data”.
Anonymization
Anonymity
Privacy
W25LO
YCX38
3TCB2
BZDU2
41OCD
A9DEY
O0Y6W
TIPN8
8C5AR
USFMI
Related Work
Related Work
Backstorm present two active attacks on edge
privacy assuming that the adversary is able to
change the network before its release.
Creates O(logN) nodes that re-identify O(log2N)
users.
 Disadvantages:
[2]
 Limited
to online social networks (even here difficult)
 No control on incoming edges
 Easy to detect (pattern stands out)
Models and Definitions
Data Release Model
Select subset of nodes Vsan  V and subsets X san  X , Ysan  Y
of node and edge attributes to be released
 Compute the induced subgraph on Vsan
 Remove some edges and add fake edges
(perturbation)


Summary: a sanitized subset of nodes and edges
with the corresponding attributes.
The Attacker Model


Many types of attackers (government, marketing,
spammers etc)
Usually have access to different network Saux whose
membership partially overlaps with S
 Might
be extracted from S (automatic crawling,
malicious third-party application)
 Aggregation projects
 Collude with an operator of a different network
S Saux
Saux S
S
Saux
Aux. Information Def.


Saux: a graph Gaux={Vaux,Eaux} and a set of
probability distributions AuxX and AuxY , one for
each attribute of every node in Vaux and each
attribute of every edge in Eaux.
For example, P[X = “friendship”] = 0.8
P[X = “contact”] = 0.2
Seeds: attacker possesses detailed information
about a negligible number of members of S.

How?
Re-identification Algorithm Def.


Re-identification algorithm - a probabilistic
mapping µ:Vsan×Vaux → [0, 1] where µ(vaux, vsan) is
the probability that vaux is mapped to vsan.
Privacy breach – for node vaux in Vaux let
µREAL(vaux)=vsan. We say that the privacy of vsan is
breached w.r.t adversary Adv and privacy
parameter δ, if for some attribute X, that is private:
Adv[X,vaux,X[vaux]] – Aux[X, vaux, X[vaux]] > δ
Success of an Attack
Fraction of nodes re-identified.
Results in a fairly meaningless metric.
 Weight proportional to importance in the
network .
E.g degree centrality, where each node is
weighted in proportion to its degree.

De-Anonymization Algorithm
Seed Identification
TARGET
AUXILIARY
Input:
1. A clique of k nodes
known to be common to
both graphs.
2. Degree of each of these
nodes (AUXILIARY)
3. Number of common
neighbors for each pair
of nodes (AUXILIARY)
Complexity:
Exponential in k. BUT:
1. If the degree is bounded
by d, then the complexity
is O(nd(k-1)).
2. Heavily input dependant.
(High running time =>
Large number of matches)
Algorithm:
Search for a unique k-clique
(on TARGET) that has:
1. matching degrees (with
some error factor)
2. common neighbor counts
Outputs µs, a partial mapping.
Propagation Algorithm
Input: two graphs and a partial seed mapping
 Output: deterministic 1-1 mapping
Each iteration starts with the accumulated mapping.
Picks an unmapped node u in V1 and computes a score
for each unmapped node v in V2 score

Important: The algorithm finds new mappings using the
topological structure of the network and the
feedback from previous mappings.
Propagation Algorithm Cont.

Eccentricity measures how much an item in a set X “stands
out” from the rest. Defined by: max(X )  max ( X ) standard deviation
 (X )
Edge Directionality – mapping scores for nodes u and v are
computed separately for incoming and outgoing edges (and
then summed).
Node Degrees – the above works in favor of high-degree
cosine similarity
nodes => divide by square root of degree.
Revisiting Nodes – as the algorithm progresses, #mapped
nodes increases & errors decrease.
Reverse Match – every match is matched in both directions
2




func eccentricity(items):
return (max(items) - max2(items)) / std_dev(items)
func update_scores_by_direction(lgraph, rgraph, mapping, lnode, scores, direction = ' → ' or ' ← ')
for edge (lnbr direction lnode) in lgraph.edges:
if lnbr not in mapping:
continue
rnbr = mapping[lnbr]
Node
for edge (rnbr direction rnode) in rgraph.edges:
Degrees
if rnode in mapping.image:
continue
scores[rnode] += (1 / rnode.direction_degree) ^ 0.5
func update_scores(lgraph, rgraph, mapping, lnode):
scores = [0 for rnode in rgraph.nodes]
# initialize scores
update_scores_by_direction(lgraph, rgraph, mapping, lnode, scores, ' → ') # in
update_scores_by_direction(lgraph, rgraph, mapping, lnode, scores, ' ← ') # out
return scores
Edge
func find_match(lgraph, rgraph, mapping, node, theta):
Directionality
scores[node] = update_scores(lgraph, rgraph, mapping, node)
if eccentricity(scores[node]) < theta:
return NULL
Eccentricity
return the node from rgraph.nodes that maximizes scores[node]
Reverse
func propagation_step(lgraph, rgraph, mapping):
Match
Revisiting Nodes
for lnode in lgraph.nodes:
rnode
= find_match(lgraph, rgraph, mapping, lnode, theta)
reverse_match = find_match(rgraph, lgraph, invert(mapping), rnode, theta)
if rnode != NULL and reverse_match == lnode:
mapping[lnode] = rnode
while not convergence do:
propagation_step(lgraph, rgraph, seed_mapping)
Complexity



Without revisiting nodes and reverse matches
O(|E1|d2)
Revisiting: assuming that a node v is revisited only if then
number of already-mapped neighbors of v has increased:
O(|E1|d1d2)
Reverse mapping:
O((|E1| + |E2|)d1d2)
Experiments
Experiments
Type

Network Relation
.
Nodes
Edges
Av.
Deg
Crawled
Target
Twitter
Follow
224K
8.5M
27.7
2007
Auxiliary
Flickr
Contact
3.3M
53M
32.2
2007/8
In both API exposes:
 Mandatory
username
 Optional name
 Optional location
Ground Truth



Based on exact match in username/name, or the
score δ(name, username, location) is high enough.
RESULT: 27,000 mappings, called µ(G).
Seed is 150 pairs of randomly selected mappings
from µ(G) each of degree > 80 in Gaux .
Seed Identification
Node Overlap: 25%
Edge Overlap: 50%
Node Re-Identification – Effect of
Noise
Node Overlap: 25%
#Seeds: 50
Experiment Results



30.8% of the mappings were re-identified
correctly.
57% were not identified
12.1% were identified incorrectly
 41%
of them were mapped to distance 1 nodes from
the true mapping
 55% of them were mapped to nodes with the same
geographic location
 27% are completely erroneous
Conclusion


In reality anonymized graphs are released with
some attributes => de-anonymization is even
easier!
Social networks grow
 overlap
between social networks grows
 Auxiliary info is much richer
Anonymity is not sufficient for privacy
when dealing with social networks!
Questions and Discussion
Thank You!
Bibliography
1. Arvind Narayanan and Vitaly Shmatikov De-anonymizing Social Networks
2. Lars Backstrom, Cynthia Dwork and Jon M. Kleinberg Wherefore art thou r3579x?:
Anonymized social networks, hidden patterns, and structural steganography
Download