Wherefore Art Thou R3579X? Anonymized Social Networks, y Hidden Patterns and Structural

advertisement
Wherefore Art Thou R3579X?
Anonymized
y
Social Networks,
Hidden Patterns and Structural
Steganegraphy
L
Lars
B k t
Backstrom,
C thi Dwork,
Cynthia
D
k Jon
J Kleinberg
Kl i b
2010‐2‐9
1
Anonymized Networks
1. For researchers, names are meaningless
2. Users also need privacy
2010‐2‐9
2
Anonymized Network(cont
Network(cont’))
Release just a big random labeled graph
with millions of edges!
No other conceptual data released!
2010‐2‐9
3
Attack Scenario
1. Attacker may add a few new nodes prior to release.
- Would like to add as few as possible
2. Attacker may add edges from these few nodes to any
nodes in the network
- May
M link
li k tto any person whose
h
‘‘name’’ h
he knows
k
3. Goal of the attack is to discover if two named users
are connected, i.e. their connections
2010‐2‐9
4
AttackScenario(cont
Structure )
Attack
Scenario(cont’)
Sam
Bob
Lily
2010‐2‐9
Before release of
the network,
the attacker finds
some targeted users
( m d users)
(named
s s)
5
Attack Structure(cont
Structure(cont’))
Sam
H
Bob
Lily
Attacker
1 creates a few new nodes
1.
2. adds edges
g among
g them
and edges from the new
nodes to the named nodes
2010‐2‐9
6
Attack Structure(cont
Structure(cont’))
G
H
Network is released
to public!
2010‐2‐9
7
Attack Structure(cont
Structure(cont’))
G
H
Attacker
locates
the
h inserted
dH
2010‐2‐9
8
Attack Structure(cont
Structure(cont’))
G
Sam
Follow
F
ll
links
li k to
t
identify named users
Bob
Lily
2010‐2‐9
9
Attack Structure(cont
Structure(cont’))
G
Follow
F
ll
links
li k to
t
identify named users
Attacker then
determines which
pairs are connected!
Done!
2010‐2‐9
10
How can the attacker
locates
th inserted
the
i
t d H
2010‐2‐9
11
Walk-Based Attack
1 Attacker chooses k named users
1.
W = {w1,w2, . . . ,wk}
2010‐2‐9
12
Walk-Based Attack
1 Attacker chooses k named users
1.
W = {w1,w2, . . . ,wk}
2. Creates k new nodes
X = {x1
{x1, x2
x2, . . . , xk}
H
Size k
2010‐2‐9
13
Walk-Based Attack
1 Attacker chooses k named users
1.
W = {w1,w2, . . . ,wk}
2. Creates k new nodes
X = {x1
{x1, x2
x2, . . . , xk}
H
2010‐2‐9
3 Creates edges (xi,wi)
3.
(xi wi)
14
Walk-Based Attack
1 Attacker chooses k named users
1.
W = {w1,w2, . . . ,wk}
2. Creates k new nodes
X = {x1
{x1, x2
x2, . . . , xk}
H
3 Creates edges (xi,wi)
3.
(xi wi)
4. Creates each edge (xi, xi+1) – Hamiltonian Path
g with p
probability
y 0.5
and each other edge
2010‐2‐9
15
G
Is the attacker able to
locate
the inserted H now
2010‐2‐9
16
G
This might not work!
2010‐2‐9
17
G
Why this might not work?
2010‐2‐9
18
G
Why this might not work?
More than one matches!
H?
2010‐2‐9
No way to differentiate!
19
Success Conditions
U i
Uniqueness
G
H
S
- No S != X such that G[S]
and G[X] = H are isomorphic
- H has no non-trivial
non trivial
automorphisms
Intuitively,
Isomorphism: without labeling, two graphs are actually the same
Automorphism: a graph has internal symmetry (relabeling its nodes preserves graph’s structure)
2010‐2‐9
20
Success Conditions(cont
Conditions(cont’))
U i
Uniqueness
G
H
S
- No S != X such that G[S]
and G[X] = H are isomorphic
- H has no non-trivial
non trivial
automorphisms
Recoverability
- Find H efficiently, given G
2010‐2‐9
21
Uniqueness Proof
Intuition:
Intuition:
- Erdös
- Erdös
E d showed
h
d that
h a random
d
graph
h will
ll not
contain certain non-random subgraphs
g p
- We need to show that a non-random graph
will not have a certain random subgraph
2010‐2‐9
22
Uniqueness Proof
P
Proof
f technique:
t h i
1 Count the number of
1.
size k subgraphs in G’ = G − H
2010‐2‐9
23
Uniqueness Proof
P
Proof
f technique:
t h i
1 Count the number of
1.
size k subgraphs in G’ = G − H
2. Find probability that each one is a match
2010‐2‐9
24
Uniqueness Proof
P
Proof
f technique:
t h i
1 Count the number of
1.
size k subgraphs in G’ = G − H
2. Find probability that each one is a match
3. Take union-bound to show that there is
no match with high probability
2010‐2‐9
25
Uniqueness Proof(cont
Proof(cont’))
1 In G’
1.
G , number of labeled subgraph of size k,
k
2. Probability that a particular labeled subgraph
matches H is no more
m
m
than
3. Probability that there is at least a match in these
3
subgraphs is (use union bound)
It goes to 0 exponentially quickly in k !
2010‐2‐9
26
Uniqueness Proof(cont
Proof(cont’)
)
v
Intuition:
1. k is small, high
g probability
p
y
that there’s match in G’;
2 k iis llarge, low
2.
l
probability.
b bilit
But we want k to be relatively small,
so we have to find a Balance here!
We choose
2010‐2‐9
for a small constant
27
Uniqueness Proof(cont
Proof(cont’))
Showed that G’ has
no copies
p
of H
(with high probability).
However,
attaching H to G’ may make a copy in G.
2010‐2‐9
28
Uniqueness Proof(cont
Proof(cont’))
Replace a few nodes from H
-> nodes from G’ must have
correct connections to H
-> Unlikely
2010‐2‐9
29
Uniqueness Proof(cont
Proof(cont’))
Replace many nodes from H
->> internal connections of
nodes must be correct as well
-> More Unlikely
Analysis uses similar techniques, but is more complex!
2010‐2‐9
30
How to find the unique H in G?
2010‐2‐9
31
How to find the unique H in G?
R
Recovery
Algorithm
l
h
Brute force with pruning
2010‐2‐9
32
Recovery Algorithm
1. Start from root, pick all the
nodes that have the degree
g
of x1,
to be root’s children
2010‐2‐9
33
Recovery Algorithm
1. Start from root, pick all the
nodes that have the degree
g
of x1,
to be root’s children
2. Find neighbors of nodes in level 1,
which have degree
g
of x2, to be their
corresponding children
2010‐2‐9
34
Recovery Algorithm
1. Start from root, pick all the
nodes that have the degree
g
of x1,
to be root’s children
2. Find neighbors of nodes in level 1,
which have degree
g
of x2, to be their
corresponding children
3 Continue
3.
C ti
until
til llevell of
fk
k. If there’s
th
’ only
l one path
th
from root, success; Otherwise, H is not unique.
2010‐2‐9
Try once more!
35
Recovery Runtime
Maximum degree of H - O(log n)
keeps number of feasible paths
relatively
y small since H is small
Analysis more complex, but in
expectation,
t ti
th
the number
b of
f
feasible paths:
Th
Thus,
total
t t l number
b of
f paths
th is
i only
l slightly
li htl superlinear
superlinear,
li
and so we can find H efficiently !
2010‐2‐9
36
Experimental Results
- Simulated attack on
LiveJournal – 4.4 million
nodes,, 77 million
m
edges
g
- Constructed H with degrees
randomly selected in [10, 20]
or [20,
[20 60],
60] while varying k
2010‐2‐9
37
Experimental Results
-With 7 nodes,
nodes d in [20, 60],
successful over 90%
90%;
compromises an average of 70
users
-Recovery typically takes less
than a second ; size of search
tree about 90,000
2010‐2‐9
38
Comparison of Attack
Walk-Based Attack
Walk- Adds
new nodes
- Can compromise
nodes at most
- Simple to execute, difficult to detect
Cut-Based Attack (Using Gomory
CutGomory--Hu tree)
- Adds
new nodes – theoretical minimum for
any attack
tt k of
f thi
this style
t l
- Attacks
nodes
- Easy for curator to detect on real data
- Both attacks work no matter how many
y people
p p
execute them
2010‐2‐9
39
Conclusion
- Doesn’t apply
pp y to networks that are already
y public
p
- Cannot rely on anonymization to ensure privacy in
N t
Networks
k
- Further directions
1) Design a general interactive method whereby researchers
may
y make queries
q
about the network
2) Non-interactive methods where the released data is
perturbed in such a way that it is still useful to researchers,
researchers
but provably anonymized
2010‐2‐9
40
Thank you!
Thank you!
2010‐2‐9
41
Download