Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong Privacy preserving data publishing Microdata Name Age Sex Zipcode Disease Andy 4 M 12000 gastric ulcer Bill 5 M 14000 dyspepsia Ken 6 M 18000 pneumonia Nash 9 M 19000 bronchitis Alice 12 F 22000 flu Betty 19 F 24000 pneumonia Linda 21 F 33000 gastritis Jane 25 F 34000 gastritis Sarah 28 F 37000 flu Mary 56 F 58000 flu • Purposes: – Allow researchers to effectively study the correlation between various attributes – Protect the privacy of every patient A naïve solution Name Age Sex Andy 4 M Bill 5 M Ken 6 M Nash 9 M Alice 12 F Betty 19 F Linda 21 F Jane 25 F Sarah 28 F Mary 56 F Zipcode 12000 14000 18000 19000 22000 24000 33000 34000 37000 58000 Disease gastric ulcer dyspepsia pneumonia bronchitis flu pneumonia gastritis gastritis flu flu publish • It does not work. See next. Age 4 5 6 9 12 19 21 25 28 56 Sex Zipcode Disease M 12000 gastric ulcer M 14000 dyspepsia M 18000 pneumonia M 19000 bronchitis F 22000 flu F 24000 pneumonia F 33000 gastritis F 34000 gastritis F 37000 flu F 58000 flu Inference attack Published table Age 4 5 6 9 12 19 21 25 28 56 An external database (a voter registration list) Sex Zipcode Disease M 12000 gastric ulcer M 14000 dyspepsia M 18000 pneumonia M 19000 bronchitis F 22000 flu F 24000 pneumonia F 33000 gastritis F 34000 gastritis F 37000 flu F 58000 flu Name Age Sex Zipcode Andy 4 M 12000 Bill 5 M 14000 Ken 6 M 18000 Nash 9 M 19000 Mike 7 M 17000 Alice 12 F 22000 Betty 19 F 24000 Linda 21 F 33000 Jane 25 F 34000 Sarah 28 F 37000 Mary 56 F 58000 Quasi-identifier (QI) attributes An adversary Generalization • Transform each QI value into a less specific form A generalized table Age [1, 5] [1, 5] [6, 10] [6, 10] [11, 20] [11, 20] [21, 25] [21, 25] [26, 60] [26, 60] Sex M M M M F F F F F F Zipcode [10001, 15000] [10001, 15000] [15001, 20000] [15001, 20000] [20001, 25000] [20001, 25000] [30001, 35000] [30001, 35000] [35001, 60000] [35001, 60000] Information loss Disease gastric ulcer dyspepsia pneumonia bronchitis flu pneumonia gastritis gastritis flu flu An external database Name Age Sex Zipcode Andy 4 M 12000 Bill 5 M 14000 Ken 6 M 18000 Nash 9 M 19000 Mike 7 M 17000 Alice 12 F 22000 Betty 19 F 24000 Linda 21 F 33000 Jane 25 F 34000 Sarah 28 F 37000 Mary 56 F 58000 k-anonymity • The following table is 2-anonymous Quasi-identifier (QI) attributes 5 QI groups Age [1, 5] [1, 5] [6, 10] [6, 10] [11, 20] [11, 20] [21, 25] [21, 25] [26, 60] [26, 60] Sex M M M M F F F F F F Zipcode [10001, 15000] [10001, 15000] [15001, 20000] [15001, 20000] [20001, 25000] [20001, 25000] [30001, 35000] [30001, 35000] [35001, 60000] [35001, 60000] Sensitive attribute Disease gastric ulcer dyspepsia pneumonia bronchitis flu pneumonia gastritis gastritis flu flu Drawback of k-anonymity • What is the disease of Linda? A 2-anonymous table Age [1, 5] [1, 5] [6, 10] [6, 10] [11, 20] [11, 20] [21, 25] [21, 25] [26, 60] [26, 60] Sex M M M M F F F F F F Zipcode [10001, 15000] [10001, 15000] [15001, 20000] [15001, 20000] [20001, 25000] [20001, 25000] [30001, 35000] [30001, 35000] [35001, 60000] [35001, 60000] Disease gastric ulcer dyspepsia pneumonia bronchitis flu pneumonia gastritis gastritis flu flu An external database Name Age Sex Zipcode Andy 4 M 12000 Bill 5 M 14000 Ken 6 M 18000 Nash 9 M 19000 Mike 7 M 17000 Alice 12 F 22000 Betty 19 F 24000 Linda 21 F 33000 Jane 25 F 34000 Sarah 28 F 37000 Mary 56 F 58000 A better criterion: l-diversity • Each QI-group – has at least l different sensitive values – even the most frequent sensitive value does not have a lot of tuples A 2-diverse table Age [1, 5] [1, 5] [6, 10] [6, 10] [11, 20] [11, 20] [21, 60] [21, 60] [21, 60] [21, 60] Sex M M M M F F F F F F Zipcode Disease [10001, 15000] gastric ulcer [10001, 15000] dyspepsia [15001, 20000] pneumonia [15001, 20000] bronchitis [20001, 25000] flu [20001, 25000] pneumonia [30001, 60000] gastritis [30001, 60000] gastritis [30001, 60000] flu [30001, 60000] flu An external database Name Age Sex Zipcode Andy 4 M 12000 Bill 5 M 14000 Ken 6 M 18000 Nash 9 M 19000 Alice 12 F 22000 Mike 7 M 17000 Betty 19 F 24000 Linda 21 F 33000 Jane 25 F 34000 Sarah 28 F 37000 Mary 56 F 58000 Motivation 1: Personalization • Andy does not want anyone to know that he had a stomach problem • Sarah does not mind at all if others find out that she had flu A 2-diverse table Age [1, 5] [1, 5] [6, 10] [6, 10] [11, 20] [11, 20] [21, 60] [21, 60] [21, 60] [21, 60] Sex M M M M F F F F F F Zipcode Disease [10001, 15000] gastric ulcer [10001, 15000] dyspepsia [15001, 20000] pneumonia [15001, 20000] bronchitis [20001, 25000] flu [20001, 25000] pneumonia [30001, 60000] gastritis [30001, 60000] gastritis [30001, 60000] flu [30001, 60000] flu An external database Name Age Sex Zipcode Andy 4 M 12000 Bill 5 M 14000 Ken 6 M 18000 Nash 9 M 19000 Mike 7 M 17000 Alice 12 F 22000 Betty 19 F 24000 Linda 21 F 33000 Jane 25 F 34000 Sarah 28 F 37000 Mary 56 F 58000 Motivation 2: Non-primary case Microdata Name Age Sex Andy 4 M Andy 4 M Ken 6 M Nash 9 M Alice 12 F Betty 19 F Linda 21 F Jane 25 F Sarah 28 F Mary 56 F Zipcode 12000 12000 18000 19000 22000 24000 33000 34000 37000 58000 Disease gastric ulcer dyspepsia pneumonia bronchitis flu pneumonia gastritis gastritis flu flu Motivation 2: Non-primary case (cont.) 2-diverse table Age 4 4 [6, 10] [6, 10] [11, 20] [11, 20] [21, 60] [21, 60] [21, 60] [21, 60] Sex M M M M F F F F F F Zipcode Disease 12000 gastric ulcer 12000 dyspepsia [15001, 20000] pneumonia [15001, 20000] bronchitis [20001, 25000] flu [20001, 25000] pneumonia [30001, 60000] gastritis [30001, 60000] gastritis [30001, 60000] flu [30001, 60000] flu An external database Name Age Sex Zipcode Andy 4 M 12000 Ken 6 M 18000 Nash 9 M 19000 Mike 7 M 17000 Alice 12 F 22000 Betty 19 F 24000 Linda 21 F 33000 Jane 25 F 34000 Sarah 28 F 37000 Mary 56 F 58000 Motivation 3: SA generalization • How many female patients are there with age above 30? • 4 ∙ (60 – 30 + 1) / (60 – 21 + 1) = 3 • Real answer: 1 A generalized table Age [1, 5] [1, 5] [6, 10] [6, 10] [11, 20] [11, 20] [21, 60] [21, 60] [21, 60] [21, 60] Sex M M M M F F F F F F Zipcode Disease [10001, 15000] gastric ulcer [10001, 15000] dyspepsia [15001, 20000] pneumonia [15001, 20000] bronchitis [20001, 25000] flu [20001, 25000] pneumonia [30001, 60000] gastritis [30001, 60000] gastritis [30001, 60000] flu [30001, 60000] flu An external database Name Age Sex Zipcode Andy 4 M 12000 Bill 5 M 14000 Ken 6 M 18000 Nash 9 M 19000 Mike 7 M 17000 Alice 12 F 22000 Betty 19 F 24000 Linda 21 F 33000 Jane 25 F 34000 Sarah 28 F 37000 Mary 56 F 58000 Motivation 3: SA generalization (cont.) • Generalization of the sensitive attribute is beneficial in this case A better generalized table Age [1, 5] [1, 5] [6, 10] [6, 10] [11, 20] [11, 20] [21, 30] [21, 30] [21, 30] Sex M M M M F F F F F 56 F Zipcode Disease [10001, 15000] gastric ulcer [10001, 15000] dyspepsia [15001, 20000] pneumonia [15001, 20000] bronchitis [20001, 25000] flu [20001, 25000] pneumonia [30001, 40000] gastritis [30001, 40000] gastritis [30001, 40000] flu respiratory 58000 infection An external database Name Age Sex Zipcode Andy 4 M 12000 Bill 5 M 14000 Ken 6 M 18000 Nash 9 M 19000 Mike 7 M 17000 Alice 12 F 22000 Betty 19 F 24000 Linda 21 F 33000 Jane 25 F 34000 Sarah 28 F 37000 Mary 56 F 58000 Personalized anonymity • We propose – a mechanism to capture personalized privacy requirements – criteria for measuring the degree of security provided by a generalized table – an algorithm for generating publishable tables Guarding node any illness digestive system problem respiratory system problem stomach disease respiratory infection gastric ulcer flu pneumonia bronchitis dyspepsia gastritis • Andy does not want anyone to know that he had a stomach problem • He can specify “stomach disease” as the guarding node for his tuple Name Age Sex Andy 4 M Zipcode Disease guarding node 12000 gastric ulcer stomach disease • The data publisher should prevent an adversary from associating Andy with “stomach disease” Guarding node any illness digestive system problem respiratory system problem stomach disease respiratory infection gastric ulcer flu pneumonia bronchitis dyspepsia gastritis • Sarah is willing to disclose her exact symptom • She can specify Ø as the guarding node for her tuple Name Age Sex Sarah 28 F Zipcode Disease guarding node 37000 flu Ø Guarding node any illness digestive system problem respiratory system problem stomach disease respiratory infection gastric ulcer flu pneumonia bronchitis dyspepsia gastritis • Bill does not have any special preference • He can specify the guarding node for his tuple as the same with his sensitive value Name Age Sex Bill 5 M Zipcode Disease guarding node 14000 dyspepsia dyspepsia A personalized approach any illness digestive system problem respiratory system problem stomach disease respiratory infection flu pneumonia bronchitis Name Age Sex Andy 4 M Bill 5 M Ken 6 M Nash 9 M Alice 12 F Betty 19 F Linda 21 F Jane 25 F Sarah 28 F Mary 56 F Zipcode 12000 14000 18000 19000 22000 24000 33000 34000 37000 58000 gastric ulcer dyspepsia gastritis Disease guarding node gastric ulcer stomach disease dyspepsia dyspepsia pneumonia respiratory infection bronchitis bronchitis flu flu pneumonia pneumonia gastritis gastritis gastritis Ø flu Ø flu flu Personalized anonymity Name Age Sex Andy 4 M Bill 5 M Ken 6 M Nash 9 M Alice 12 F Betty 19 F Linda 21 F Jane 25 F Sarah 28 F Mary 56 F • • Zipcode 12000 14000 18000 19000 22000 24000 33000 34000 37000 58000 Disease guarding node gastric ulcer stomach disease dyspepsia dyspepsia pneumonia respiratory infection bronchitis bronchitis flu flu pneumonia pneumonia gastritis gastritis gastritis Ø flu Ø flu flu A table satisfies personalized anonymity with a parameter pbreach – Iff no adversary can breach the privacy requirement of any tuple with a probability above pbreach If pbreach = 0.3, then any adversary should have no more than 30% probability to find out that: – Andy had a stomach disease – Bill had dyspepsia – etc Personalized anonymity • Personalized anonymity with respect to a predefined parameter pbreach – an adversary can breach the privacy requirement of any tuple with a probability at most pbreach • We need a method for calculating the breach probabilities Age [1, 10] [1, 10] [1, 10] [1, 10] [11, 20] [11, 20] 21 25 28 56 Sex M M M M F F F F F F Zipcode [10001, 20000] [10001, 20000] [10001, 20000] [10001, 20000] [20001, 25000] [20001, 25000] 33000 34000 37000 58000 Disease gastric ulcer dyspepsia pneumonia bronchitis flu pneumonia stomach disease gastritis flu respiratory infection What is the probability that Andy had some stomach problem? Combinatorial reconstruction • Assumptions – the adversary has no prior knowledge about each individual – every individual involved in the microdata also appears in the external database Combinatorial reconstruction • Andy does not want anyone to know that he had some stomach problem • What is the probability that the adversary can find out that “Andy had a stomach disease”? Name Age Sex Zipcode Andy 4 M 12000 Bill 5 M 14000 Ken 6 M 18000 Nash 9 M 19000 Mike 7 M 17000 Alice 12 F 22000 Betty 19 F 24000 Linda 21 F 33000 Jane 25 F 34000 Sarah 28 F 37000 Mary 56 F 58000 Age [1, 10] [1, 10] [1, 10] [1, 10] [11, 20] [11, 20] 21 25 28 56 Sex M M M M F F F F F F Zipcode [10001, 20000] [10001, 20000] [10001, 20000] [10001, 20000] [20001, 25000] [20001, 25000] 33000 34000 37000 58000 Disease gastric ulcer dyspepsia pneumonia bronchitis flu pneumonia stomach disease gastritis flu respiratory infection Combinatorial reconstruction (cont.) • Can each individual appear more than once? – No = the primary case – Yes = the non-primary case • Some possible reconstructions: the primary case Andy Bill Ken Nash Mike gastric ulcer dyspepsia pneumonia bronchitis the non-primary case Andy Bill Ken Nash Mike gastric ulcer dyspepsia pneumonia bronchitis Combinatorial reconstruction (cont.) • Can each individual appear more than once? – No = the primary case – Yes = the non-primary case • Some possible reconstructions: the primary case Andy Bill Ken Nash Mike gastric ulcer dyspepsia pneumonia bronchitis the non-primary case Andy Bill Ken Nash Mike gastric ulcer dyspepsia pneumonia bronchitis Breach probability (primary) Andy Bill Ken Nash Mike gastric ulcer dyspepsia pneumonia bronchitis any illness digestive system problem respiratory system problem respiratory infection flu pneumonia bronchitis stomach disease gastric ulcer dyspepsia gastritis • Totally 120 possible reconstructions • If Andy is associated with a stomach disease in nb reconstructions • The probability that the adversary should associate Andy with some stomach problem is nb / 120 • Andy is associated with – gastric ulcer in 24 reconstructions – dyspepsia in 24 reconstructions – gastritis in 0 reconstructions • nb = 48 • The breach probability for Andy’s tuple is 48 / 120 = 2 / 5 Breach probability (non-primary) Andy Bill Ken Nash Mike gastric ulcer dyspepsia pneumonia bronchitis any illness digestive system problem respiratory system problem respiratory infection flu pneumonia bronchitis stomach disease gastric ulcer dyspepsia gastritis • Totally 625 possible reconstructions • Andy is associated with gastric ulcer or dyspepsia or gastritis in 225 reconstructions • nb = 225 • The breach probability for Andy’s tuple is 225 / 625 = 9 / 25 Breach probability: Formal results Name Age Sex Zipcode Andy 4 M 12000 Bill 5 M 14000 Ken 6 M 18000 Nash 9 M 19000 Mike 7 M 17000 Alice 12 F 22000 Betty 19 F 24000 Linda 21 F 33000 Jane 25 F 34000 Sarah 28 F 37000 Mary 56 F 58000 Age [1, 10] [1, 10] [1, 10] [1, 10] [11, 20] [11, 20] 21 25 28 56 Sex M M M M F F F F F F Zipcode [10001, 20000] [10001, 20000] [10001, 20000] [10001, 20000] [20001, 25000] [20001, 25000] 33000 34000 37000 58000 Disease gastric ulcer dyspepsia pneumonia bronchitis flu pneumonia stomach disease gastritis flu respiratory infection Breach probability: Formal results Name Age Sex Zipcode Andy 4 M 12000 Bill 5 M 14000 Ken 6 M 18000 Nash 9 M 19000 Mike 7 M 17000 Alice 12 F 22000 Betty 19 F 24000 Linda 21 F 33000 Jane 25 F 34000 Sarah 28 F 37000 Mary 56 F 58000 Age [1, 10] [1, 10] [1, 10] [1, 10] [11, 20] [11, 20] 21 25 28 56 Sex M M M M F F F F F F Zipcode [10001, 20000] [10001, 20000] [10001, 20000] [10001, 20000] [20001, 25000] [20001, 25000] 33000 34000 37000 58000 Disease gastric ulcer dyspepsia pneumonia bronchitis flu pneumonia stomach disease gastritis flu respiratory infection More in our paper • An algorithm for computing generalized tables that – satisfies personalized anonymity with predefined pbreach – reduces information loss by employing generalization on both the QI attributes and the sensitive attribute Experiment settings 1 • Goal: To show that k-anonymity and l-diversity do not always provide sufficient privacy protection • Real dataset Age • • • • Education Gender Pri-leaf Nonpri-leaf Pri-mixed Nonpri-mixed • Cardinality = 100k Marital-status Occupation Income Degree of privacy protection (Pri-leaf) pbreach = 0.25 (k = 4, l = 4) Degree of privacy protection (Nonpri-leaf) pbreach = 0.25 (k = 4, l = 4) Degree of privacy protection (Pri-mixed) pbreach = 0.25 (k = 4, l = 4) Degree of privacy protection (Nonpri-mixed) pbreach = 0.25 (k = 4, l = 4) Experiment settings 2 • Goal: To show that applying generalization on both the QI attributes and the sensitive attribute will lead to more effective data analysis Accuracy of analysis (no personalization) Accuracy of analysis (with personalization) Conclusions • k-anonymity and l-diversity are not sufficient for the Non-primary case • Guarding nodes allow individuals to describe their privacy requirements better • Generalization on the sensitive attribute is beneficial Thank you! Datasets and implementation are available for download at http://www.cs.cityu.edu.hk/~taoyf