ppt

advertisement
Resisting Structural Re-identification in
Anonymized Social Networks
Michael Hay, Gerome Miklau, David Jensen, Don Towsley, Philipp Weis
University of Massachusetts Amherst
Session : Privacy & Authentication, VLDB 2008
2011-01-21
Presented by Yongjin Kwon
Outline
 Introduction
 Adversary Knowledge Models

Vertex Refinement Queries

Subgraph Queries

Hub Fingerprint Queries
 Disclosure in Real Networks
 Anonymity in Random Graphs
 Graph Generalization for Anonymization
 Conclusion
Copyright  2011 by CEBT
2
Introduction
 There are a large amount of data in various storages.

Supermarket Transactions

Web Sever Logs

Sensor Data

Interactions in Social Networks

Email, Twitter

…
 Data owners publish sensitive information to facilitate research.

Reveal as much important information as possible while
preserving the privacy of the individuals in the data.

In personal data, analysts may find valuable information.
Copyright  2011 by CEBT
3
Introduction (Cont’d)
 A Face Is Exposed for AOL Searcher No. 4417749 [New York
Times, August 9, 2006]

AOL collected 20 million Web search queries and published them.

Although the company naïvely anonymized the data, the identity
of AOL user “No. 4417749” revealed: “Thelma Arnold, a 62-yearold widow who lives in Lilburn, Ga., frequently researches her
friends’ medical ailments and loves her three dogs.”

Serious problem of privacy risks!
Copyright  2011 by CEBT
4
Introduction (Cont’d)
 Potential privacy risks in network data

Risk network structure in the early epidemic phase of HIV transmission in Colorado Springs [Sexually Trans. Infections, 2002]
–

A social network, which represents a set of individuals related by
sexual contacts and shared drug injections, is published in order to
analyze how HIV spreads.
Enron Email Dataset (http://www.cs.cmu.edu/~enron/)
–
The email collection was released for investigation.
–
It is the only “real” email collection due to the privacy issues.
Copyright  2011 by CEBT
5
Introduction (Cont’d)
 Attacks on (naïvely anonymized network data)

Wherefore Art Thou R3579X? Anonymized Social Networks,
Hidden Patterns, and Structural Steganography [WWW 2007]

Active Attack

–
An adversary chooses a set of targets, creates a small number of fake
nodes with edges to these targets, and construct a highly identifiable
pattern of links among the new nodes.
–
After the network is released, the adversary can recognize the pattern
and fake nodes, and reveal the sensitive information of targets.
Passive Attack
–
Most vertices in network data usually belong to a small uniquely
identifiable subgraph.
–
An adversary may collude with other friends to identify additional
nodes connected to the distinct subset of the coalition.
Copyright  2011 by CEBT
6
Introduction (Cont’d)
 An adversary may compromise privacy of some victims with
some (structural) background knowledge.

The naïve anonymization is NOT sufficient!

A new way of resisting malicious actions to re-identify the identity
of each individual in a published network data must be proposed.
 Need to think of…

Types of adversary knowledge

Theoretical approach of privacy risks

A way of preserving privacy while maintaining high utility of data
Copyright  2011 by CEBT
7
Adversary Knowledge Models
 The adversary’s background knowledge is modeled as “correct”
answers to a restricted knowledge query.

The adversary uses the query to refine the feasible candidate set.
 Three knowledge models

Vertex Refinement Queries

Subgraph Queries

Hub Fingerprint Queries
Copyright  2011 by CEBT
8
Vertex Refinement Queries
 These queries report on the local structure of the graph
around the “target” node.
B
Degree of B
Degrees of neighbors of B
Copyright  2011 by CEBT
9
Vertex Refinement Queries (Cont’d)
 Relative Equivalence

A
B
D
F

C
E
G
H
If the adversary knows the answer of
, then G can be
quickly re-identified in the anonymized graph!
Copyright  2011 by CEBT
10
Subgraph Queries
 Two drawbacks of vertex refinement queries

Always return “correct” information.

Depend on the degree of the target node.
 These queries assert the existence of a subgraph around the
“target” node.
B

B
B
B
Edge Facts :
3
4
5
Assume that the adversary knows the number of edge facts
around the target node.
Copyright  2011 by CEBT
11
Hub Fingerprint Queries
 A hub is a node with high degree and high betweenness
centrality.

Hubs are easily re-identified by an adversary.
 A hub fingerprint for a node is a vector of distances from
observable hub connections.
A
Hub
B
D
F
C
E
G
H
Closed World : Not reachable within distance 1
Open World : Incomplete knowledge
Copyright  2011 by CEBT
12
Disclosure in Real Networks
 Experiments for the impact of external information

Three networked data set
–
Hep-Th : co-author graphs, taken from the arXiv archive
–
Enron : “real” email dataset, collected by the CALO Project
–
Net-trace : IP-level network trace collected at a major university

Consider each node in turn as a target.

Compute the candidate set for the target.
–

Smaller candidate set : more vulnerable!
Characterize how many nodes are protected and how many are
re-identifiable.
Copyright  2011 by CEBT
13
Disclosure in Real Networks (Cont’d)
 Vertex Refinement Queries
Copyright  2011 by CEBT
14
Disclosure in Real Networks (Cont’d)
 Subgraph Queries

Two Strategies to
build subgraphs
–
Sampled Subgraph
–
Degree Subgraph
Copyright  2011 by CEBT
15
Disclosure in Real Networks (Cont’d)
 Hub Fingerprint Queries

Hub : five highest degree nodes (Enron), ten highest degree
nodes (Hep-Th, Net-trace)
Copyright  2011 by CEBT
16
Anonymity in Random Graphs
 Theoretical approach of privacy risk with random graphs

Erdős-Rényi Model (ER Model) with n nodes and edge connection
probability p.
–
 Asymptotic analysis of robustness against

Sparse ER Graphs : robust against

Dense ER Graphs : robust against

Super-dense ER Graphs : vulnerable against
Copyright  2011 by CEBT
knowledge attack
for any
, but vulnerable against
17
Anonymity in Random Graphs (Cont’d)
 Anonymity Against Subgraph Queries

Depends on the number of nodes in the largest clique

If

The clique number is a useful lower bound on the disclosure.
for a subgraph query
, then
 Random Graphs with Attributes
Copyright  2011 by CEBT
18
Graph Generalization for Anonymization
 Generalize a naïvely-anonymized graph.

Much uncertainty! (measured by the number of possible world)

Find the partitioning that maximizes the likelihood
while satisfying that the size of a supernode is larger than k.

Apply the simulated annealing method to find the partitioning.
Copyright  2011 by CEBT
19
Graph Generalization for Anonymization (Cont’d)
 How to analyze the generalized graph?

Construct the synthetic graph using the tagged information.

Perform standard graph analysis on this synthetic graph.
Copyright  2011 by CEBT
20
Graph Generalization for Anonymization (Cont’d)
 How does graph generalization affect network properties?

Examine five properties on the three real-world networks.
–
Degree
–
Path Length
–
Transitivity (Clustering Coefficient)
–
Network Resilience
–
Infectiousness

Perform the experiments on the 200 synthetic graphs.

Repeat for each
.
Copyright  2011 by CEBT
21
Graph Generalization for Anonymization (Cont’d)
Copyright  2011 by CEBT
22
Conclusion
 Three contributions

Formalize models of adversary knowledge.

Provide a start point of theoretical study of privacy risks on a
network data.

Introduce a new anonymization technique by generalizing the
original graph.
Copyright  2011 by CEBT
23
Download