Anonymizing healthcare data: a case study on the blood transfusion

advertisement
Differentially Private Data Release
for Data Mining
Noman Mohammed
Rui Chen
Benjamin C.M. Fung
Philip S. Yu
Concordia University
Montreal, QC, Canada
Concordia University
Montreal, QC, Canada
Concordia University
Montreal, QC, Canada
University of Illinois
at Chicago, IL, USA
Outline
2






Overview
Differential privacy
Related Work
Our Algorithm
Experimental results
Conclusion
2
Overview
3
Privacy model
Anonymization
algorithm
Data utility
3
Contributions
4




Proposed an anonymization algorithm that provides
differential privacy guarantee
Generalization-based algorithm for differentially
private data release
Proposed algorithm can handle both categorical and
numerical attributes
Preserves information for classification analysis
4
Outline
5






Overview
Differential privacy
Related Work
Our Algorithm
Experimental results
Conclusion
5
Differential Privacy [DMNS06]
6
D
D’
D and D’ are neighbors if they
differ on at most one record
A non-interactive privacy mechanism A gives ε-differential
privacy if for all neighbour D and D’, and for any possible
sanitized database D*
PrA[A(D) = D*] ≤ exp(ε) × PrA[A(D’) = D*]
6
Laplace Mechanism [DMNS06]
7
∆f = maxD,D’||f(D) – f(D’)||1
For a counting query f: ∆f =1
For example, for a single counting query Q over a dataset D,
returning Q(D) + Laplace(1/ε) maintains ε-differential privacy.
7
Exponential Mechanism [MT07]
8
Given a utility function u : (D × T) → R for a database
instance D, the mechanism A,


A(D, u) = return t with probability proportional to
exp(ε×u(D, t)/2 ∆u)
gives ε-differential privacy.
8
Composition properties
9
Sequential composition
∑iεi –differential privacy
Parallel composition
max(εi)–differential privacy
9
Outline
10






Overview
Differential privacy
Related Work
Our Algorithm
Experimental results
Conclusion
10
Two Frameworks
11

Interactive: Multiple questions asked/answered adaptively
Anonymizer
11
Two Frameworks
12

Interactive: Multiple questions asked/answered adaptively
Anonymizer

Non-interactive: Data is anonymized and released
Anonymizer
12
Related Work
13
 A. Blum, C. Dwork, F. McSherry, and K. Nissim. Practical privacy:
The SuLQ framework. In PODS, 2005.
 A. Friedman and A. Schuster. Data mining with differential
privacy. In SIGKDD, 2010.
Is it possible to release data for classification analysis ?
13
Why Non-interactive framework ?
14

Disadvantages of interactive approach:
 Database can answer a limited number of queries
 Big problem if there are many data miners
 Provide less flexibility to perform data analysis
14
Non-interactive Framework
15
0 + Lap(1/ε)
15
Non-interactive Framework
16
0 + Lap(1/ε)
For high-dimensional
data, noise is too big
16
Non-interactive Framework
17
17
Outline
18






Overview
Differential privacy
Related Work
Our Algorithm
Experimental results
Conclusion
18
Anonymization Algorithm
19
Professional
[18-65)
Job
Age
Class
Count
Any_Job
[18-65)
4Y4N
8
2Y2N
Artist
4
Artist
Professional
[18-40)
2Y1N
3
[18-40)
Professional
2Y2N
[18-65)
4
[40-65)
0Y1N
2Y2N
Artist
[40-65)
Age
[18-65)
Professional
Lawyer
0Y0N
1
Job
Any_Job
Engineer
4
Artist
Dancer
[18-40)
Writer
[18-30)
[40-65)
[30-40)
19
0
Candidate Selection
20


we favor the specialization with maximum Score value
First utility function:
∆u =

Second utility function:
∆u = 1
20
20
Split Value
21


The split value of a categorical attribute is determined according to
the taxonomy tree of the attribute
How to determine the split value for numerical attribute ?
21
Split Value
22


The split value of a categorical attribute is determined according to
the taxonomy tree of the attribute
How to determine the split value for numerical attribute ?
Age
Class
60
Y
30
N
25
Y
40
N
25
Y
40
N
45
N
25
Y
18
65
25
30
40
45
60
22
Anonymization Algorithm
23
O(Aprx|D|log|D|)
O(|candidates|)
O(|D|)
O(|D|log|D|)
23
O(1)
Anonymization Algorithm
24
O(Aprx|D|log|D|)
O((Apr+h)x|D|log|D|)
O(|candidates|)
O(|D|)
O(|D|log|D|)
24
O(1)
Outline
25






Overview
Differential privacy
Related Work
Our Algorithm
Experimental results
Conclusion
25
Experimental Evaluation
26
 Adult:
is a Census data (from UCI repository)
6
continuous attributes.
 8 categorical attributes.
 45,222 census records
26
Data Utility for Max
27
ε = 0,1
ε = 0,25
ε = 0,5
ε=1
86
BA = 85.3%
Average Accuracy (%)
84
82
80
78
76
LA = 75.5%
74
4
7
10
13
Number of specializations
16
27
Data Utility for InfoGain
28
ε = 0,1
ε = 0,25
ε = 0,5
ε=1
86
BA = 85.3%
Average Accuracy (%)
84
82
80
78
76
LA = 75.5%
74
4
7
10
13
Number of specializations
16
28
Comparison
29
DiffP-C4,5
DiffGen (h=15)
TDS (k=5)
86
BA = 85.3%
Average Accuracy (%)
84
82
80
78
76
LA = 75.5%
74
0,75
1
2
Privacy Budget
3
4
29
Scalability
30
Reading
Anonymization
Writing
Total
180
160
Time (seconds)
140
ε=1
h=15
120
100
80
60
40
20
0
200
400
600
800
# of Records (in thousands)
1000
30
Outline
31






Overview
Differential privacy
Related Work
Our Algorithm
Experimental results
Conclusion
31
Conclusions
32

Differentially Private Data Release
 Generalization-based
differentially private algorithm
 Provides better utility than existing techniques
32
33
Thank You Very Much

Q&A
33
33
Download