Achieving Anonymity via Clustering G. Aggarwal, T. Feder, K. Kenthapadi, S. Khuller, R. Panigrahy, D. Thomas, A. Zhu Dilys Thomas PODS 2006 1 Talk outline • • • • • k-Anonymity model Achieving Anonymity via Clustering r-Gather clustering Cellular clustering Future Work Dilys Thomas PODS 2006 2 Medical Records Identifying Sensitive SSN Name DOB Race Zip code Disease 614 615 629 710 840 780 Sara Joan Kelly Mike Carl Joe 03/04/76 07/11/80 05/09/55 11/23/62 11/23/62 01/07/50 Cauc Cauc Cauc Afr-A Afr-A Hisp 94305 94307 94301 94305 94059 94042 619 Rob 04/08/43 Hisp 94042 Dilys Thomas PODS 2006 Flu Cold Diabetes Flu Arthritis Heart problem Arthritis 3 De-identified Medical Records Sensitive Age Race Zip code Disease 03/04/76 07/11/80 Cauc 94305 Flu Cauc 94307 Cold 05/09/55 Cauc 94301 Diabetes 11/23/62 Afr-A 94305 Flu 11/23/62 Afr-A 94059 Arthritis 01/07/50 Hisp 94042 Heart problem 04/08/43 Hisp 94042 Arthritis Dilys Thomas PODS 2006 4 k-Anonymity model Sensitive Uniquely identify you! DOB Race Zip code Disease 03/04/76 Cauc 94305 Flu 07/11/80 Cauc 94307 Cold Quasi-identifiers: 94301 Diabetes approximate foreign keys 05/09/55 Cauc 12/30/72 Afr-A 94305 Flu 11/23/62 Afr-A 94059 Arthritis 01/07/50 Hisp 94042 04/08/43 Hisp 94042 Heart problem Arthritis Dilys Thomas PODS 2006 5 k-Anonymity Model [Swe00] • Suppress some entries of quasi-identifiers – each modified row becomes identical to at least k-1 other rows with respect to quasi-identifiers • Individual records hidden in a crowd of size k Dilys Thomas PODS 2006 6 2-Anonymized Table DOB Race Zip code Disease * Cauc * Flu * Cauc * Cold * Cauc * Diabetes 11/23/62 Afr-A * Flu 11/23/62 Afr-A * Arthritis * Hisp 94042 Heart problem * Hisp 94042 Arthritis Dilys Thomas PODS 2006 7 k-Anonymity Optimization • Minimize the number of generalizations/ suppressions to achieve k-Anonymity • NP-hard to come up with minimum suppressions/ generalizations.[MW04] • (k) approximation for k-anonymity [AFK+05] • (k) lower bound on approximation ratio with graph assumption Dilys Thomas PODS 2006 8 Talk outline • • • • • k-Anonymity model Achieving Anonymity via Clustering r-Gather clustering Cellular Clustering Future Work Dilys Thomas PODS 2006 9 Original Table Age Salary Amy 25 50 Brian 27 60 Carol 29 100 David 35 110 Evelyn 39 120 Dilys Thomas PODS 2006 10 2-Anonymity with Suppression Age Salary Amy * * Brian * * Carol * * David * * Evelyn * * All attributes suppressed Dilys Thomas PODS 2006 11 Original Table Age Salary Amy 25 50 Brian 27 60 Carol 29 100 David 35 110 Evelyn 39 120 Dilys Thomas PODS 2006 12 2-Anonymity with Generalization Age Salary Amy 20-30 50-100 Brian 20-30 50-100 Carol 20-30 50-100 David 30-40 100-150 Evelyn 30-40 100-150 Generalization allows pre-specified ranges Dilys Thomas PODS 2006 13 Original Table Age Salary Amy 25 50 Brian 27 60 Carol 29 100 David 35 110 Evelyn 39 120 Dilys Thomas PODS 2006 14 2-Anonymity with Clustering Amy Age Salary [25-29] [50-100] Brian [25-29] [50-100] Carol [25-29] [50-100] David [35-39] [110-120] Evelyn [35-39] [110-120] 27=(25+27+29)/3 70=(50+60+100)/3 37=(35+39)/2 115=(110+120)/2 Cluster centers published Dilys Thomas PODS 2006 15 Advantages of Clustering • Clustering reduces the amount of distortion introduced as compared to suppressions / generalizations • Clustering allows constant factor approximation algorithms Dilys Thomas PODS 2006 16 Quasi-Identifiers form a Metric Space • Convert quasi-identifiers into points in a metric space • Distance function, D, on points – D(X,X)=0 Reflexive – D(X,Y)=D(Y,X) Symmetric – D(X,Z) <= D(X,Y) + D(Y,Z) Triangle Inequality Dilys Thomas PODS 2006 17 Metric Space • Converting (gender, zip code, DOB) into points in a metric space not easy. • Define distance function on each attribute. • E.g. on Zip code: – D (Zip1,Zip2)= physical distance between locations Zip1 and Zip2. • Weight attributes, weighted sum of attribute distances gives metric. Dilys Thomas PODS 2006 18 Clustering for Anonymity • Cluster Quasi-identifiers so that each cluster has at least r members for anonymity. • Publish cluster centers for anonymity with number of point and radius • Tight clusters Usefulness of data for mining • Large number of points per cluster Anonymity Dilys Thomas PODS 2006 19 Quasi-identifiers: Metric Space • Assume further that the distance metric has been already defined on quasi-identifiers Dilys Thomas PODS 2006 20 Talk outline • • • • • k-Anonymity model Achieving Anonymity via Clustering r-Gather clustering Cellular Clustering Future Work Dilys Thomas PODS 2006 21 r-Gather Clustering 10 points, radius 5 50 points, radius 20 20 points, radius 10 • Minimize the maximum radius: 20 Dilys Thomas PODS 2006 22 Results • 2 Approximation to minimize maximum radius with cluster size constraint • Matching Lower bound of 2 for maximum radius minimization Dilys Thomas PODS 2006 23 r-Gather Clustering 2d 2d 2d Dilys Thomas PODS 2006 24 Lower Bound: Reduction from 3-SAT C1=X1 Æ X2 X2T X1 T C1 r-2 points r-2 points X 1F X 2F • r-gather with radius 1 iff formula satisfiable Else radius ¸ 2 Dilys Thomas PODS 2006 25 Talk outline • • • • • k-Anonymity model Achieving Anonymity via Clustering r-Gather clustering Cellular Clustering Future Work Dilys Thomas PODS 2006 26 Cellular Clustering 10 points, radius 5 50 points, radius 20 20 points, radius 10 Dilys Thomas PODS 2006 27 Cellular Clustering Metric 10 points, radius 5 50 points, radius 20 20 points, radius 10 Cellular Clustering Metric: 10*5 + 20*10 + 50*20 = 50 + 200 + 1000 = 1250 Dilys Thomas PODS 2006 28 Cellular Clustering • Primal dual 4-approximation algorithm for cellular clustering • Constant factor approximation to minimum cluster size – Each cluster has at least r points Dilys Thomas PODS 2006 29 Cellular Clustering: Linear Program Minimize c ( i xicdc + fc yc) Sum of Cellular cost and facility cost Subject to: c xic ¸ 1 Each Point belongs to a cluster xic· yc Cluster must be opened for point to belong 0 · xic · 1 Points belong to clusters positively 0 · yc · 1 Clusters are opened positively Dilys Thomas PODS 2006 30 Dual Program • Maximize i i • Subject to: i ic · fc (1) i - ic · dc (2) i ¸ 0 ic ¸ 0 Overview of Algorithm: First grow i keeping ic=0 till (2) becomes tight then grow ic at same rate till (1) becomes tight Dilys Thomas PODS 2006 31 Future Work • Improve approximation ratio for Cellular Clustering • Improve Running time. Presently r-gather is O(n2) while cellular clustering is a linear program over n2 variables. – Linear or even sub-linear time algorithms • Weaker guarantees on anonymity, e.g. at least k/2 points per cluster instead of k. Dilys Thomas PODS 2006 32 THANK YOU! QUESTIONS? Dilys Thomas PODS 2006 33