Slide

advertisement
Personalized Privacy
Preservation
Xiaokui Xiao, Yufei Tao
City University of Hong Kong
Privacy preserving data publishing
Microdata
Name Age Sex Zipcode
Disease
Andy 4
M
12000 gastric ulcer
Bill
5
M
14000
dyspepsia
Ken
6
M
18000 pneumonia
Nash
9
M
19000
bronchitis
Alice 12
F
22000
flu
Betty 19
F
24000 pneumonia
Linda 21
F
33000
gastritis
Jane 25
F
34000
gastritis
Sarah 28
F
37000
flu
Mary 56
F
58000
flu
• Purposes:
– Allow researchers to effectively study the correlation
between various attributes
– Protect the privacy of every patient
A naïve solution
Name Age Sex
Andy 4 M
Bill
5 M
Ken
6 M
Nash
9 M
Alice 12 F
Betty 19 F
Linda 21 F
Jane 25 F
Sarah 28 F
Mary 56 F
Zipcode
12000
14000
18000
19000
22000
24000
33000
34000
37000
58000
Disease
gastric ulcer
dyspepsia
pneumonia
bronchitis
flu
pneumonia
gastritis
gastritis
flu
flu
publish
• It does not work. See next.
Age
4
5
6
9
12
19
21
25
28
56
Sex Zipcode
Disease
M 12000 gastric ulcer
M 14000
dyspepsia
M 18000 pneumonia
M 19000
bronchitis
F
22000
flu
F
24000 pneumonia
F
33000
gastritis
F
34000
gastritis
F
37000
flu
F
58000
flu
Inference attack
Published table
Age
4
5
6
9
12
19
21
25
28
56
An external database
(a voter registration list)
Sex Zipcode
Disease
M 12000 gastric ulcer
M 14000
dyspepsia
M 18000 pneumonia
M 19000
bronchitis
F
22000
flu
F
24000 pneumonia
F
33000
gastritis
F
34000
gastritis
F
37000
flu
F
58000
flu
Name Age Sex Zipcode
Andy 4
M 12000
Bill
5
M 14000
Ken
6
M 18000
Nash
9
M 19000
Mike
7
M 17000
Alice 12 F
22000
Betty 19 F
24000
Linda 21 F
33000
Jane 25 F
34000
Sarah 28 F
37000
Mary 56 F
58000
Quasi-identifier (QI) attributes
An adversary
Generalization
• Transform each QI value into a less specific form
A generalized table
Age
[1, 5]
[1, 5]
[6, 10]
[6, 10]
[11, 20]
[11, 20]
[21, 25]
[21, 25]
[26, 60]
[26, 60]
Sex
M
M
M
M
F
F
F
F
F
F
Zipcode
[10001, 15000]
[10001, 15000]
[15001, 20000]
[15001, 20000]
[20001, 25000]
[20001, 25000]
[30001, 35000]
[30001, 35000]
[35001, 60000]
[35001, 60000]
Information loss
Disease
gastric ulcer
dyspepsia
pneumonia
bronchitis
flu
pneumonia
gastritis
gastritis
flu
flu
An external database
Name Age Sex Zipcode
Andy 4
M 12000
Bill
5
M 14000
Ken
6
M 18000
Nash
9
M 19000
Mike
7
M 17000
Alice 12 F
22000
Betty 19 F
24000
Linda 21 F
33000
Jane 25 F
34000
Sarah 28 F
37000
Mary 56 F
58000
k-anonymity
• The following table is 2-anonymous
Quasi-identifier (QI) attributes
5 QI groups
Age
[1, 5]
[1, 5]
[6, 10]
[6, 10]
[11, 20]
[11, 20]
[21, 25]
[21, 25]
[26, 60]
[26, 60]
Sex
M
M
M
M
F
F
F
F
F
F
Zipcode
[10001, 15000]
[10001, 15000]
[15001, 20000]
[15001, 20000]
[20001, 25000]
[20001, 25000]
[30001, 35000]
[30001, 35000]
[35001, 60000]
[35001, 60000]
Sensitive attribute
Disease
gastric ulcer
dyspepsia
pneumonia
bronchitis
flu
pneumonia
gastritis
gastritis
flu
flu
Drawback of k-anonymity
• What is the disease of Linda?
A 2-anonymous table
Age
[1, 5]
[1, 5]
[6, 10]
[6, 10]
[11, 20]
[11, 20]
[21, 25]
[21, 25]
[26, 60]
[26, 60]
Sex
M
M
M
M
F
F
F
F
F
F
Zipcode
[10001, 15000]
[10001, 15000]
[15001, 20000]
[15001, 20000]
[20001, 25000]
[20001, 25000]
[30001, 35000]
[30001, 35000]
[35001, 60000]
[35001, 60000]
Disease
gastric ulcer
dyspepsia
pneumonia
bronchitis
flu
pneumonia
gastritis
gastritis
flu
flu
An external database
Name Age Sex Zipcode
Andy 4
M 12000
Bill
5
M 14000
Ken
6
M 18000
Nash
9
M 19000
Mike
7
M 17000
Alice 12 F
22000
Betty 19 F
24000
Linda 21 F
33000
Jane 25 F
34000
Sarah 28 F
37000
Mary 56 F
58000
A better criterion: l-diversity
• Each QI-group
– has at least l different sensitive values
– even the most frequent sensitive value does not have a lot of
tuples
A 2-diverse table
Age
[1, 5]
[1, 5]
[6, 10]
[6, 10]
[11, 20]
[11, 20]
[21, 60]
[21, 60]
[21, 60]
[21, 60]
Sex
M
M
M
M
F
F
F
F
F
F
Zipcode
Disease
[10001, 15000] gastric ulcer
[10001, 15000] dyspepsia
[15001, 20000] pneumonia
[15001, 20000] bronchitis
[20001, 25000]
flu
[20001, 25000] pneumonia
[30001, 60000]
gastritis
[30001, 60000]
gastritis
[30001, 60000]
flu
[30001, 60000]
flu
An external database
Name Age Sex Zipcode
Andy 4
M 12000
Bill
5
M 14000
Ken
6
M 18000
Nash
9
M 19000
Alice 12 F
22000
Mike
7
M 17000
Betty 19 F
24000
Linda 21 F
33000
Jane 25 F
34000
Sarah 28 F
37000
Mary 56 F
58000
Motivation 1: Personalization
• Andy does not want anyone to know that he had a stomach problem
• Sarah does not mind at all if others find out that she had flu
A 2-diverse table
Age
[1, 5]
[1, 5]
[6, 10]
[6, 10]
[11, 20]
[11, 20]
[21, 60]
[21, 60]
[21, 60]
[21, 60]
Sex
M
M
M
M
F
F
F
F
F
F
Zipcode
Disease
[10001, 15000] gastric ulcer
[10001, 15000] dyspepsia
[15001, 20000] pneumonia
[15001, 20000] bronchitis
[20001, 25000]
flu
[20001, 25000] pneumonia
[30001, 60000]
gastritis
[30001, 60000]
gastritis
[30001, 60000]
flu
[30001, 60000]
flu
An external database
Name Age Sex Zipcode
Andy 4
M 12000
Bill
5
M 14000
Ken
6
M 18000
Nash
9
M 19000
Mike
7
M 17000
Alice 12 F
22000
Betty 19 F
24000
Linda 21 F
33000
Jane 25 F
34000
Sarah 28 F
37000
Mary 56 F
58000
Motivation 2: Non-primary case
Microdata
Name Age Sex
Andy 4 M
Andy 4 M
Ken
6 M
Nash
9 M
Alice 12 F
Betty 19 F
Linda 21 F
Jane 25 F
Sarah 28 F
Mary 56 F
Zipcode
12000
12000
18000
19000
22000
24000
33000
34000
37000
58000
Disease
gastric ulcer
dyspepsia
pneumonia
bronchitis
flu
pneumonia
gastritis
gastritis
flu
flu
Motivation 2: Non-primary case (cont.)
2-diverse table
Age
4
4
[6, 10]
[6, 10]
[11, 20]
[11, 20]
[21, 60]
[21, 60]
[21, 60]
[21, 60]
Sex
M
M
M
M
F
F
F
F
F
F
Zipcode
Disease
12000
gastric ulcer
12000
dyspepsia
[15001, 20000] pneumonia
[15001, 20000] bronchitis
[20001, 25000]
flu
[20001, 25000] pneumonia
[30001, 60000]
gastritis
[30001, 60000]
gastritis
[30001, 60000]
flu
[30001, 60000]
flu
An external database
Name Age Sex Zipcode
Andy 4
M 12000
Ken
6
M 18000
Nash
9
M 19000
Mike
7
M 17000
Alice 12 F
22000
Betty 19 F
24000
Linda 21 F
33000
Jane 25 F
34000
Sarah 28 F
37000
Mary 56 F
58000
Motivation 3: SA generalization
• How many female patients are there with age above 30?
• 4 ∙ (60 – 30 + 1) / (60 – 21 + 1) = 3
• Real answer: 1
A generalized table
Age
[1, 5]
[1, 5]
[6, 10]
[6, 10]
[11, 20]
[11, 20]
[21, 60]
[21, 60]
[21, 60]
[21, 60]
Sex
M
M
M
M
F
F
F
F
F
F
Zipcode
Disease
[10001, 15000] gastric ulcer
[10001, 15000] dyspepsia
[15001, 20000] pneumonia
[15001, 20000] bronchitis
[20001, 25000]
flu
[20001, 25000] pneumonia
[30001, 60000]
gastritis
[30001, 60000]
gastritis
[30001, 60000]
flu
[30001, 60000]
flu
An external database
Name Age Sex Zipcode
Andy 4
M 12000
Bill
5
M 14000
Ken
6
M 18000
Nash
9
M 19000
Mike
7
M 17000
Alice 12 F
22000
Betty 19 F
24000
Linda 21 F
33000
Jane 25 F
34000
Sarah 28 F
37000
Mary 56 F
58000
Motivation 3: SA generalization (cont.)
• Generalization of the sensitive attribute is beneficial in this case
A better generalized table
Age
[1, 5]
[1, 5]
[6, 10]
[6, 10]
[11, 20]
[11, 20]
[21, 30]
[21, 30]
[21, 30]
Sex
M
M
M
M
F
F
F
F
F
56
F
Zipcode
Disease
[10001, 15000] gastric ulcer
[10001, 15000] dyspepsia
[15001, 20000] pneumonia
[15001, 20000] bronchitis
[20001, 25000]
flu
[20001, 25000] pneumonia
[30001, 40000]
gastritis
[30001, 40000]
gastritis
[30001, 40000]
flu
respiratory
58000
infection
An external database
Name Age Sex Zipcode
Andy 4
M 12000
Bill
5
M 14000
Ken
6
M 18000
Nash
9
M 19000
Mike
7
M 17000
Alice 12 F
22000
Betty 19 F
24000
Linda 21 F
33000
Jane 25 F
34000
Sarah 28 F
37000
Mary 56 F
58000
Personalized anonymity
• We propose
– a mechanism to capture personalized privacy
requirements
– criteria for measuring the degree of security
provided by a generalized table
– an algorithm for generating publishable tables
Guarding node
any illness
digestive system problem
respiratory system problem
stomach disease
respiratory infection
gastric
ulcer
flu pneumonia bronchitis
dyspepsia gastritis
• Andy does not want anyone to know that he had a stomach problem
• He can specify “stomach disease” as the guarding node for his tuple
Name Age Sex
Andy
4
M
Zipcode
Disease
guarding node
12000
gastric ulcer
stomach disease
• The data publisher should prevent an adversary from associating
Andy with “stomach disease”
Guarding node
any illness
digestive system problem
respiratory system problem
stomach disease
respiratory infection
gastric
ulcer
flu pneumonia bronchitis
dyspepsia gastritis
• Sarah is willing to disclose her exact symptom
• She can specify Ø as the guarding node for her tuple
Name Age Sex
Sarah
28
F
Zipcode
Disease
guarding node
37000
flu
Ø
Guarding node
any illness
digestive system problem
respiratory system problem
stomach disease
respiratory infection
gastric
ulcer
flu pneumonia bronchitis
dyspepsia gastritis
• Bill does not have any special preference
• He can specify the guarding node for his tuple as the same with his
sensitive value
Name Age Sex
Bill
5
M
Zipcode
Disease
guarding node
14000
dyspepsia
dyspepsia
A personalized approach
any illness
digestive system problem
respiratory system problem
stomach disease
respiratory infection
flu pneumonia bronchitis
Name Age Sex
Andy 4 M
Bill
5 M
Ken
6 M
Nash
9 M
Alice 12 F
Betty 19 F
Linda 21 F
Jane 25 F
Sarah 28 F
Mary 56 F
Zipcode
12000
14000
18000
19000
22000
24000
33000
34000
37000
58000
gastric
ulcer
dyspepsia gastritis
Disease
guarding node
gastric ulcer
stomach disease
dyspepsia
dyspepsia
pneumonia respiratory infection
bronchitis
bronchitis
flu
flu
pneumonia
pneumonia
gastritis
gastritis
gastritis
Ø
flu
Ø
flu
flu
Personalized anonymity
Name Age Sex
Andy 4 M
Bill
5 M
Ken
6 M
Nash
9 M
Alice 12 F
Betty 19 F
Linda 21 F
Jane 25 F
Sarah 28 F
Mary 56 F
•
•
Zipcode
12000
14000
18000
19000
22000
24000
33000
34000
37000
58000
Disease
guarding node
gastric ulcer
stomach disease
dyspepsia
dyspepsia
pneumonia respiratory infection
bronchitis
bronchitis
flu
flu
pneumonia
pneumonia
gastritis
gastritis
gastritis
Ø
flu
Ø
flu
flu
A table satisfies personalized anonymity with a parameter pbreach
– Iff no adversary can breach the privacy requirement of any tuple with a probability
above pbreach
If pbreach = 0.3, then any adversary should have no more than 30% probability
to find out that:
– Andy had a stomach disease
– Bill had dyspepsia
– etc
Personalized anonymity
• Personalized anonymity with respect to a predefined
parameter pbreach
– an adversary can breach the privacy requirement of any tuple with
a probability at most pbreach
• We need a method for calculating the breach probabilities
Age
[1, 10]
[1, 10]
[1, 10]
[1, 10]
[11, 20]
[11, 20]
21
25
28
56
Sex
M
M
M
M
F
F
F
F
F
F
Zipcode
[10001, 20000]
[10001, 20000]
[10001, 20000]
[10001, 20000]
[20001, 25000]
[20001, 25000]
33000
34000
37000
58000
Disease
gastric ulcer
dyspepsia
pneumonia
bronchitis
flu
pneumonia
stomach disease
gastritis
flu
respiratory infection
What is the probability
that Andy had some
stomach problem?
Combinatorial reconstruction
• Assumptions
– the adversary has no prior knowledge about each
individual
– every individual involved in the microdata also appears
in the external database
Combinatorial reconstruction
• Andy does not want anyone to know that he had some
stomach problem
• What is the probability that the adversary can find out that
“Andy had a stomach disease”?
Name Age Sex Zipcode
Andy 4
M 12000
Bill
5
M 14000
Ken
6
M 18000
Nash
9
M 19000
Mike
7
M 17000
Alice 12 F
22000
Betty 19 F
24000
Linda 21 F
33000
Jane 25 F
34000
Sarah 28 F
37000
Mary 56 F
58000
Age
[1, 10]
[1, 10]
[1, 10]
[1, 10]
[11, 20]
[11, 20]
21
25
28
56
Sex
M
M
M
M
F
F
F
F
F
F
Zipcode
[10001, 20000]
[10001, 20000]
[10001, 20000]
[10001, 20000]
[20001, 25000]
[20001, 25000]
33000
34000
37000
58000
Disease
gastric ulcer
dyspepsia
pneumonia
bronchitis
flu
pneumonia
stomach disease
gastritis
flu
respiratory infection
Combinatorial reconstruction (cont.)
• Can each individual appear more than once?
– No = the primary case
– Yes = the non-primary case
• Some possible reconstructions:
the primary case
Andy
Bill
Ken
Nash
Mike
gastric ulcer
dyspepsia
pneumonia
bronchitis
the non-primary case
Andy
Bill
Ken
Nash
Mike
gastric ulcer
dyspepsia
pneumonia
bronchitis
Combinatorial reconstruction (cont.)
• Can each individual appear more than once?
– No = the primary case
– Yes = the non-primary case
• Some possible reconstructions:
the primary case
Andy
Bill
Ken
Nash
Mike
gastric ulcer
dyspepsia
pneumonia
bronchitis
the non-primary case
Andy
Bill
Ken
Nash
Mike
gastric ulcer
dyspepsia
pneumonia
bronchitis
Breach probability (primary)
Andy
Bill
Ken
Nash
Mike
gastric ulcer
dyspepsia
pneumonia
bronchitis
any illness
digestive system problem
respiratory system problem
respiratory infection
flu pneumonia bronchitis
stomach disease
gastric
ulcer
dyspepsia gastritis
• Totally 120 possible reconstructions
• If Andy is associated with a stomach disease in nb reconstructions
• The probability that the adversary should associate Andy with some
stomach problem is nb / 120
• Andy is associated with
– gastric ulcer in 24 reconstructions
– dyspepsia in 24 reconstructions
– gastritis in 0 reconstructions
• nb = 48
• The breach probability for Andy’s tuple is 48 / 120 = 2 / 5
Breach probability (non-primary)
Andy
Bill
Ken
Nash
Mike
gastric ulcer
dyspepsia
pneumonia
bronchitis
any illness
digestive system problem
respiratory system problem
respiratory infection
flu pneumonia bronchitis
stomach disease
gastric
ulcer
dyspepsia gastritis
• Totally 625 possible reconstructions
• Andy is associated with gastric ulcer or
dyspepsia or gastritis in 225 reconstructions
• nb = 225
• The breach probability for Andy’s tuple is
225 / 625 = 9 / 25
Breach probability: Formal results
Name Age Sex Zipcode
Andy 4
M 12000
Bill
5
M 14000
Ken
6
M 18000
Nash
9
M 19000
Mike
7
M 17000
Alice 12 F
22000
Betty 19 F
24000
Linda 21 F
33000
Jane 25 F
34000
Sarah 28 F
37000
Mary 56 F
58000
Age
[1, 10]
[1, 10]
[1, 10]
[1, 10]
[11, 20]
[11, 20]
21
25
28
56
Sex
M
M
M
M
F
F
F
F
F
F
Zipcode
[10001, 20000]
[10001, 20000]
[10001, 20000]
[10001, 20000]
[20001, 25000]
[20001, 25000]
33000
34000
37000
58000
Disease
gastric ulcer
dyspepsia
pneumonia
bronchitis
flu
pneumonia
stomach disease
gastritis
flu
respiratory infection
Breach probability: Formal results
Name Age Sex Zipcode
Andy 4
M 12000
Bill
5
M 14000
Ken
6
M 18000
Nash
9
M 19000
Mike
7
M 17000
Alice 12 F
22000
Betty 19 F
24000
Linda 21 F
33000
Jane 25 F
34000
Sarah 28 F
37000
Mary 56 F
58000
Age
[1, 10]
[1, 10]
[1, 10]
[1, 10]
[11, 20]
[11, 20]
21
25
28
56
Sex
M
M
M
M
F
F
F
F
F
F
Zipcode
[10001, 20000]
[10001, 20000]
[10001, 20000]
[10001, 20000]
[20001, 25000]
[20001, 25000]
33000
34000
37000
58000
Disease
gastric ulcer
dyspepsia
pneumonia
bronchitis
flu
pneumonia
stomach disease
gastritis
flu
respiratory infection
More in our paper
• An algorithm for computing generalized
tables that
– satisfies personalized anonymity with
predefined pbreach
– reduces information loss by employing
generalization on both the QI attributes and
the sensitive attribute
Experiment settings 1
• Goal: To show that k-anonymity and l-diversity do not
always provide sufficient privacy protection
• Real dataset
Age
•
•
•
•
Education
Gender
Pri-leaf
Nonpri-leaf
Pri-mixed
Nonpri-mixed
• Cardinality = 100k
Marital-status
Occupation
Income
Degree of privacy protection (Pri-leaf)
pbreach = 0.25
(k = 4, l = 4)
Degree of privacy protection (Nonpri-leaf)
pbreach = 0.25
(k = 4, l = 4)
Degree of privacy protection (Pri-mixed)
pbreach = 0.25
(k = 4, l = 4)
Degree of privacy protection (Nonpri-mixed)
pbreach = 0.25
(k = 4, l = 4)
Experiment settings 2
• Goal: To show that applying generalization on
both the QI attributes and the sensitive attribute
will lead to more effective data analysis
Accuracy of analysis (no personalization)
Accuracy of analysis (with personalization)
Conclusions
• k-anonymity and l-diversity are not
sufficient for the Non-primary case
• Guarding nodes allow individuals to
describe their privacy requirements better
• Generalization on the sensitive attribute is
beneficial
Thank you!
Datasets and implementation are
available for download at
http://www.cs.cityu.edu.hk/~taoyf
Download