Rui Chen , Concordia University
Noman Mohammed , Concordia University
Benjamin C. M. Fung , Concordia University
Bipin C. Desai , Concordia University
Li Xiong , Emory University
VLDB 2011
1
Outline
Preliminaries
Sanitization algorithm
Experimental results
Conclusions
2
Introduction
• The problem: non-interactive set-valued data publication under differential privacy
• Typical set-valued data: transaction data , web search queries
3
Introduction
• Set-valued data refers to the data in which each record owner is associated with a set of items drawn from an item universe .
TID Items t1 I1, I2, I3, I4 t2 I2, I4 t3 I2 t4 I1, I2 t5 I2 t6 I1 t7 I1, I2, I3, I4 t8 I2, I3, I4
4
Introduction
• Existing works [1, 2, 3, 4, 5, 6, 7] on publishing set-valued data are based on partitioned-based privacy models [8].
• They provide insufficient privacy protection.
– Composition attack [8]
– deFinetti attack [9]
– Foreground knowledge attack [10]
• They are vulnerable to background knowledge.
5
Introduction
• Differential privacy is independent of an adversary’ background knowledge and computational power (with exceptions [11]).
• The outcome of any analysis should not overly depend on a single data record.
• Existing differentially private data publishing approaches are not adequate in terms of both utility and scalability for our problem.
6
Introduction
• Problems of data-independent publishing approaches:
Universe I = {I
1
, I
2
, I
3
}
I1
I2
I3
I1, I2
I1, I3
I2, I3
I1, I2, I3
• Scalability : O(2 n )
• Utility : noise accumulates exponentially
7
Outline
Introduction
Sanitization algorithm
Experimental results
Conclusions
8
Preliminaries
• Context-free taxonomy tree
• Each internal node is a set of their leaves, not necessarily the semantic generalization
9
Preliminaries
• Differential privacy [12]
D D’
• D and D’ are neighbors if they differ on at most one record
A non-interactive privacy mechanism A gives ε-differential privacy if for all neighbours D, D’, and for any possible sanitized database D* ∈ Range(A),
Pr
A
[A(D) = D*] ≤ exp(ε) × Pr
A
[A(D’) = D*]
10
Preliminaries
• Laplace mechanism [12]
Global
Sensitivity
For example, for a single counting query Q over a dataset
D, returning Q(D)+Laplace(1/ε) gives ε-differential privacy.
11
Preliminaries
• Exponential mechanism [13]
Given a utility function q : (D × R) → R for a database instance D, the mechanism A,
A(D, q) = { return r with probability
exp(ε
q(D, r)/2
q) } gives ε-differential privacy.
12
Preliminaries
• Composition properties [14]
Sequential composition
∑ i
ε i
–differential privacy
Parallel composition max(ε i
)–differential privacy
13
Preliminaries
• Utility metrics
For a given itemset I’ I , a counting query Q over a dataset D is defined to be Q ( D )
| { t
D : I '
t } |
A privacy mechanism A is (α, δ)-useful if with probability
1- δ, for every counting query and every dataset D, for
D*=A(D), |Q(D*)-Q(D)|<= α. [15]
14
Outline
Introduction
Preliminaries
Experimental results
Conclusions
15
Sanitization Algorithm
• Top-down partitioning
TID Items t1 I1, I2, I3, I4 t2 I2, I4 t3 I2 t4 I1, I2 t5 I2 t6 I1 t7 I1, I2, I3, I4 t8 I2, I3, I4
• Generalize all records to a single partition
• Keep partitioning non-empty partitions until leaf partitions are reached
16
Sanitization Algorithm
• Privacy budget allocation
• We reserve B/2 to obtain noisy sizes of leaf partitions and the rest B/2 to guide the partitioning.
• Assign less budget to more general partitions and more budget to more specific partitions.
17
Sanitization Algorithm
• Privacy budget allocation
A hierarchy cut needs at most
ui
cut
|
(
i
,
) | partition operations to reach leaf partitions.
Example: {I
{1,2}
, I
{3, 4}
} needs at most two partition operations to reach leaf partitions
18
Sanitization Algorithm
• Privacy budget allocation
• We reserve B/2 to obtain noisy sizes of leaf partitions and the rest B/2 to guide the partitioning.
• Assign less budget to more general partitions and more budget to more specific partitions.
B/2/3 = B/6
(B/2-B/6)/2 = B/6
B/6+B/2 = 2B/3
19
Sanitization Algorithm
• Sub-partition generation
• For a non-leaf partition, we need to consider all possible sub-partitions to satisfy differential privacy.
• Efficient implementation: separately handling empty and non-empty partitions (inspired by [16]).
20
Outline
Introduction
Preliminaries
Sanitization algorithm
Conclusions
21
Experiments
• Two real-life set-valued datasets are used.
• MSNBC is publicly available at UCI machine learning repository(http://archive.ics.uci.edu/ml/index.html).
• STM is provided by Societe de transport de Montreal
(STM) (http://www.stm.info).
22
Experiments
• Average relative error vs. privacy budget
B=0.5
B=0.75
B=1.0
23
Experiments
• Utility for frequent itemset mining
B=0.5
B=0.75
B=1.0
24
Experiments
• Scalability: O(|D|*|I|)
Runtime vs. |D| Runtime vs. |I|
25
Outline
Introduction
Preliminaries
Sanitization algorithm
Experimental results
26
Conclusions
• Differential privacy can be successfully applied to noninteractive set-valued data publishing with guaranteed utility.
• Differential privacy can be achieved by data-dependent solutions with improved efficiency and accuracy.
• The general idea of data-dependent solutions applies to other types of data, for example, relational data [17] and trajectory data [18].
27
References
[1] J. Cao, P. Karras, C. Raissi, and K.-L. Tan. ρ–uncertainty inference proof transaction anonymization. In VLDB, pp. 1033–1044, 2010.
[2] G. Ghinita, Y. Tao, and P. Kalnis. On the anonymization of sparse high-dimensional data. In ICDE, pp. 715–724, 2008.
[3] Y. He and J. F. Naughton. Anonymization of set-valued data via top-down, local generalization. In VLDB, pp. 934–945, 2009.
[4] M. Terrovitis, N. Mamoulis, and P. Kalnis. Privacy-preserving anonymization of setvalued data. In VLDB, pp.115–125, 2008.
[5] M. Terrovitis, N. Mamoulis, and P. Kalnis. Local and global recoding methods for anonymizing set-valued data.VLDBJ, 20(1):83–106, 2011.
[6] Y. Xu, B. C. M. Fung, K. Wang, A. W. C. Fu, and J. Pei. Publishing sensitive transactions for itemset utility. In ICDM, pp. 1109–1114, 2008.
[7] Y. Xu, K. Wang, A. W. C. Fu, and P. S. Yu. Anonymizing transaction databases for publication. In SIGKDD, pp. 767–775, 2008.
28
References
[8] S. R. Ganta, S. P. Kasiviswanathan, and A. Smith. Composition attacks and auxiliary information in data privacy. In SIGKDD, pp. 265-273, 2008.
[9] D. Kifer. Attacks on privacy and deFinetti’s theorem. In SIGMOD, pp. 127–138, 2009.
[10] R. C. W. Wong, A. Fu, K. Wang, P. S. Yu, and J. Pei. Can the utility of anonymized data be used for privacy breaches,” ACM Transactions on Knowledge Discovery
from Data, to appear.
[11] D. Kifer and A. Machanavajjhala. No free lunch in data privacy. In SIGMOD, 2011.
[12] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference, pp. 265–284, 2006.
[13] F. McSherry and K. Talwar. Mechanism design via differential privacy. In FOCS, pp.
94–103, 2007.
[14] F. McSherry. Privacy integrated queries: An extensible platform for privacypreserving data analysis. In SIGMOD, pp. 19–30, 2009.
[15] A. Blum, K. Ligett, and A. Roth. A learning theory approach to non-interactive database privacy. In STOC, pp.609–618, 2008.
29
References
[16] G. Cormode, M. Procopiuc, D. Srivastava, and T. T. L. Tran. Differentially Private
Publication of Sparse Data. In CoRR, 2011.
[17] N. Mohammed, R. Chen, B. C. M. Fung, and P. S. Yu. Differentially private data release for data mining. In SIGKDD, 2011.
[18] R. Chen, B. C. M. Fung, and B. C. Desai. Differentially private trajectory data publication. ICDE, under review, 2012.
30
31
32
Lower Bound Results
• In the interactive setting, only a limited number of queries could be answered; otherwise, an adversary would be able to precisely reconstruct almost the entire original database.
• In the non-interactive setting, one can only guarantee the utility of restricted classes of queries.
33
34
35
Threshold Selection
• We design the threshold as a function of the standard deviation of the noise and the height of a partition’s hierarchy cut :
2 C
2 height ( p .
cut ) / p .
36
Relative error
• (α, δ)-usefulness is effective to give an overall estimation of utility, but fails to produce intuitive experimental results.
• We experimentally measure the utility of sanitized data for counting queries by relative error :
| Q ( D *)
Q ( D ) max{ Q ( D ), s }
|
Sanity bound
37
Experiments
• Average relative error vs. taxonomy tree fan-out
B=0.5
B=0.75
B=1.0
38