Sensitive Identifying

advertisement
Achieving Anonymity via
Clustering
G. Aggarwal, T. Feder, K. Kenthapadi,
S. Khuller, R. Panigrahy, D. Thomas,
A. Zhu
Dilys Thomas PODS 2006
1
Talk outline
•
•
•
•
•
k-Anonymity model
Achieving Anonymity via Clustering
r-Gather clustering
Cellular clustering
Future Work
Dilys Thomas PODS 2006
2
Medical Records
Identifying
Sensitive
SSN
Name
DOB
Race
Zip code Disease
614
615
629
710
840
780
Sara
Joan
Kelly
Mike
Carl
Joe
03/04/76
07/11/80
05/09/55
11/23/62
11/23/62
01/07/50
Cauc
Cauc
Cauc
Afr-A
Afr-A
Hisp
94305
94307
94301
94305
94059
94042
619
Rob
04/08/43
Hisp
94042
Dilys Thomas PODS 2006
Flu
Cold
Diabetes
Flu
Arthritis
Heart
problem
Arthritis
3
De-identified Medical Records
Sensitive
Age
Race
Zip code Disease
03/04/76
07/11/80
Cauc
94305
Flu
Cauc
94307
Cold
05/09/55
Cauc
94301
Diabetes
11/23/62
Afr-A
94305
Flu
11/23/62
Afr-A
94059
Arthritis
01/07/50
Hisp
94042
Heart problem
04/08/43
Hisp
94042
Arthritis
Dilys Thomas PODS 2006
4
k-Anonymity model
Sensitive
Uniquely
identify
you!
DOB
Race
Zip code Disease
03/04/76 Cauc
94305
Flu
07/11/80 Cauc
94307
Cold
Quasi-identifiers:
94301
Diabetes
approximate
foreign keys
05/09/55 Cauc
12/30/72 Afr-A 94305
Flu
11/23/62 Afr-A 94059
Arthritis
01/07/50 Hisp
94042
04/08/43 Hisp
94042
Heart
problem
Arthritis
Dilys Thomas PODS 2006
5
k-Anonymity Model [Swe00]
• Suppress some entries of quasi-identifiers
– each modified row becomes identical to at least
k-1 other rows with respect to quasi-identifiers
• Individual records hidden in a crowd of size k
Dilys Thomas PODS 2006
6
2-Anonymized Table
DOB
Race
Zip code
Disease
*
Cauc
*
Flu
*
Cauc
*
Cold
*
Cauc
*
Diabetes
11/23/62 Afr-A
*
Flu
11/23/62 Afr-A
*
Arthritis
*
Hisp
94042
Heart problem
*
Hisp
94042
Arthritis
Dilys Thomas PODS 2006
7
k-Anonymity Optimization
• Minimize the number of generalizations/
suppressions to achieve k-Anonymity
• NP-hard to come up with minimum
suppressions/ generalizations.[MW04]
• (k) approximation for k-anonymity
[AFK+05]
• (k) lower bound on approximation ratio
with graph assumption
Dilys Thomas PODS 2006
8
Talk outline
•
•
•
•
•
k-Anonymity model
Achieving Anonymity via Clustering
r-Gather clustering
Cellular Clustering
Future Work
Dilys Thomas PODS 2006
9
Original Table
Age
Salary
Amy
25
50
Brian
27
60
Carol
29
100
David
35
110
Evelyn
39
120
Dilys Thomas PODS 2006
10
2-Anonymity with Suppression
Age
Salary
Amy
*
*
Brian
*
*
Carol
*
*
David
*
*
Evelyn
*
*
All attributes suppressed
Dilys Thomas PODS 2006
11
Original Table
Age
Salary
Amy
25
50
Brian
27
60
Carol
29
100
David
35
110
Evelyn
39
120
Dilys Thomas PODS 2006
12
2-Anonymity with Generalization
Age
Salary
Amy
20-30
50-100
Brian
20-30
50-100
Carol
20-30
50-100
David
30-40
100-150
Evelyn
30-40
100-150
Generalization allows pre-specified ranges
Dilys Thomas PODS 2006
13
Original Table
Age
Salary
Amy
25
50
Brian
27
60
Carol
29
100
David
35
110
Evelyn
39
120
Dilys Thomas PODS 2006
14
2-Anonymity with Clustering
Amy
Age
Salary
[25-29]
[50-100]
Brian
[25-29]
[50-100]
Carol
[25-29]
[50-100]
David
[35-39]
[110-120]
Evelyn
[35-39]
[110-120]
27=(25+27+29)/3
70=(50+60+100)/3
37=(35+39)/2
115=(110+120)/2
Cluster centers published
Dilys Thomas PODS 2006
15
Advantages of Clustering
• Clustering reduces the amount of distortion
introduced as compared to suppressions /
generalizations
• Clustering allows constant factor
approximation algorithms
Dilys Thomas PODS 2006
16
Quasi-Identifiers form a Metric Space
• Convert quasi-identifiers into points in a metric
space
• Distance function, D, on points
– D(X,X)=0 Reflexive
– D(X,Y)=D(Y,X) Symmetric
– D(X,Z) <= D(X,Y) + D(Y,Z) Triangle Inequality
Dilys Thomas PODS 2006
17
Metric Space
• Converting (gender, zip code, DOB) into
points in a metric space not easy.
• Define distance function on each attribute.
• E.g. on Zip code:
– D (Zip1,Zip2)= physical distance between
locations Zip1 and Zip2.
• Weight attributes, weighted sum of attribute
distances gives metric.
Dilys Thomas PODS 2006
18
Clustering for Anonymity
• Cluster Quasi-identifiers so that each cluster has
at least r members for anonymity.
• Publish cluster centers for anonymity with
number of point and radius
• Tight clusters  Usefulness of data for mining
• Large number of points per cluster Anonymity
Dilys Thomas PODS 2006
19
Quasi-identifiers: Metric Space
• Assume further that the distance
metric has been already defined on
quasi-identifiers
Dilys Thomas PODS 2006
20
Talk outline
•
•
•
•
•
k-Anonymity model
Achieving Anonymity via Clustering
r-Gather clustering
Cellular Clustering
Future Work
Dilys Thomas PODS 2006
21
r-Gather Clustering
10 points, radius 5
50 points, radius 20
20 points, radius 10
• Minimize the maximum radius: 20
Dilys Thomas PODS 2006
22
Results
• 2 Approximation to minimize maximum
radius with cluster size constraint
• Matching Lower bound of 2 for maximum
radius minimization
Dilys Thomas PODS 2006
23
r-Gather Clustering
2d
2d
2d
Dilys Thomas PODS 2006
24
Lower Bound: Reduction from 3-SAT
C1=X1 Æ X2
X2T
X1
T
C1
r-2 points
r-2 points
X 1F
X 2F
• r-gather with radius 1 iff formula satisfiable
Else radius ¸ 2
Dilys Thomas PODS 2006
25
Talk outline
•
•
•
•
•
k-Anonymity model
Achieving Anonymity via Clustering
r-Gather clustering
Cellular Clustering
Future Work
Dilys Thomas PODS 2006
26
Cellular Clustering
10 points, radius 5
50 points, radius 20
20 points, radius 10
Dilys Thomas PODS 2006
27
Cellular Clustering Metric
10 points, radius 5
50 points, radius 20
20 points, radius 10
Cellular Clustering Metric: 10*5 + 20*10 + 50*20
= 50 + 200 + 1000 = 1250
Dilys Thomas PODS 2006
28
Cellular Clustering
• Primal dual 4-approximation algorithm for
cellular clustering
• Constant factor approximation to minimum
cluster size
– Each cluster has at least r points
Dilys Thomas PODS 2006
29
Cellular Clustering: Linear Program
Minimize c ( i xicdc + fc yc)
Sum of Cellular cost and facility cost
Subject to:
c xic ¸ 1 Each Point belongs to a cluster
xic· yc Cluster must be opened for point to belong
0 · xic · 1 Points belong to clusters positively
0 · yc · 1 Clusters are opened positively
Dilys Thomas PODS 2006
30
Dual Program
• Maximize i i
• Subject to:
i ic · fc
(1)
i - ic · dc
(2)
i ¸ 0
ic ¸ 0
Overview of Algorithm: First grow i keeping ic=0
till (2) becomes tight then grow ic at same rate till
(1) becomes tight
Dilys Thomas PODS 2006
31
Future Work
• Improve approximation ratio for Cellular
Clustering
• Improve Running time. Presently r-gather is
O(n2) while cellular clustering is a linear
program over n2 variables.
– Linear or even sub-linear time algorithms
• Weaker guarantees on anonymity, e.g. at
least k/2 points per cluster instead of k.
Dilys Thomas PODS 2006
32
THANK YOU!
QUESTIONS?
Dilys Thomas PODS 2006
33
Download