Slides - Data Mining and Security Lab @ McGill

advertisement

Publishing Set-Valued Data via

Differential Privacy

Rui Chen , Concordia University

Noman Mohammed , Concordia University

Benjamin C. M. Fung , Concordia University

Bipin C. Desai , Concordia University

Li Xiong , Emory University

VLDB 2011

1

Outline

Introduction

 Preliminaries

 Sanitization algorithm

 Experimental results

 Conclusions

2

Introduction

• The problem: non-interactive set-valued data publication under differential privacy

• Typical set-valued data: transaction data , web search queries

3

Introduction

• Set-valued data refers to the data in which each record owner is associated with a set of items drawn from an item universe .

TID Items t1 I1, I2, I3, I4 t2 I2, I4 t3 I2 t4 I1, I2 t5 I2 t6 I1 t7 I1, I2, I3, I4 t8 I2, I3, I4

4

Introduction

• Existing works [1, 2, 3, 4, 5, 6, 7] on publishing set-valued data are based on partitioned-based privacy models [8].

• They provide insufficient privacy protection.

– Composition attack [8]

– deFinetti attack [9]

– Foreground knowledge attack [10]

• They are vulnerable to background knowledge.

5

Introduction

• Differential privacy is independent of an adversary’ background knowledge and computational power (with exceptions [11]).

• The outcome of any analysis should not overly depend on a single data record.

• Existing differentially private data publishing approaches are not adequate in terms of both utility and scalability for our problem.

6

Introduction

• Problems of data-independent publishing approaches:

Universe I = {I

1

, I

2

, I

3

}

I1

I2

I3

I1, I2

I1, I3

I2, I3

I1, I2, I3

• Scalability : O(2 n )

• Utility : noise accumulates exponentially

7

Outline

 Introduction

Preliminaries

 Sanitization algorithm

 Experimental results

 Conclusions

8

Preliminaries

• Context-free taxonomy tree

• Each internal node is a set of their leaves, not necessarily the semantic generalization

9

Preliminaries

• Differential privacy [12]

D D’

• D and D’ are neighbors if they differ on at most one record

A non-interactive privacy mechanism A gives ε-differential privacy if for all neighbours D, D’, and for any possible sanitized database D* ∈ Range(A),

Pr

A

[A(D) = D*] ≤ exp(ε) × Pr

A

[A(D’) = D*]

10

Preliminaries

• Laplace mechanism [12]

Global

Sensitivity

For example, for a single counting query Q over a dataset

D, returning Q(D)+Laplace(1/ε) gives ε-differential privacy.

11

Preliminaries

• Exponential mechanism [13]

Given a utility function q : (D × R) → R for a database instance D, the mechanism A,

A(D, q) = { return r with probability

exp(ε

×

q(D, r)/2

q) } gives ε-differential privacy.

12

Preliminaries

• Composition properties [14]

Sequential composition

∑ i

ε i

–differential privacy

Parallel composition max(ε i

)–differential privacy

13

Preliminaries

• Utility metrics

For a given itemset I’ I , a counting query Q over a dataset D is defined to be Q ( D )

| { t

D : I '

 t } |

A privacy mechanism A is (α, δ)-useful if with probability

1- δ, for every counting query and every dataset D, for

D*=A(D), |Q(D*)-Q(D)|<= α. [15]

14

Outline

 Introduction

 Preliminaries

Sanitization algorithm

 Experimental results

 Conclusions

15

Sanitization Algorithm

• Top-down partitioning

TID Items t1 I1, I2, I3, I4 t2 I2, I4 t3 I2 t4 I1, I2 t5 I2 t6 I1 t7 I1, I2, I3, I4 t8 I2, I3, I4

• Generalize all records to a single partition

• Keep partitioning non-empty partitions until leaf partitions are reached

16

Sanitization Algorithm

• Privacy budget allocation

• We reserve B/2 to obtain noisy sizes of leaf partitions and the rest B/2 to guide the partitioning.

• Assign less budget to more general partitions and more budget to more specific partitions.

17

Sanitization Algorithm

• Privacy budget allocation

A hierarchy cut needs at most

 ui

 cut

|

InternalNo des

(

u

i

,

H

) | partition operations to reach leaf partitions.

Example: {I

{1,2}

, I

{3, 4}

} needs at most two partition operations to reach leaf partitions

18

Sanitization Algorithm

• Privacy budget allocation

• We reserve B/2 to obtain noisy sizes of leaf partitions and the rest B/2 to guide the partitioning.

• Assign less budget to more general partitions and more budget to more specific partitions.

B/2/3 = B/6

(B/2-B/6)/2 = B/6

B/6+B/2 = 2B/3

19

Sanitization Algorithm

• Sub-partition generation

• For a non-leaf partition, we need to consider all possible sub-partitions to satisfy differential privacy.

• Efficient implementation: separately handling empty and non-empty partitions (inspired by [16]).

20

Outline

 Introduction

 Preliminaries

 Sanitization algorithm

Experimental results

 Conclusions

21

Experiments

• Two real-life set-valued datasets are used.

• MSNBC is publicly available at UCI machine learning repository(http://archive.ics.uci.edu/ml/index.html).

• STM is provided by Societe de transport de Montreal

(STM) (http://www.stm.info).

22

Experiments

• Average relative error vs. privacy budget

B=0.5

B=0.75

B=1.0

23

Experiments

• Utility for frequent itemset mining

B=0.5

B=0.75

B=1.0

24

Experiments

• Scalability: O(|D|*|I|)

Runtime vs. |D| Runtime vs. |I|

25

Outline

 Introduction

 Preliminaries

 Sanitization algorithm

 Experimental results

Conclusions

26

Conclusions

• Differential privacy can be successfully applied to noninteractive set-valued data publishing with guaranteed utility.

• Differential privacy can be achieved by data-dependent solutions with improved efficiency and accuracy.

• The general idea of data-dependent solutions applies to other types of data, for example, relational data [17] and trajectory data [18].

27

References

[1] J. Cao, P. Karras, C. Raissi, and K.-L. Tan. ρ–uncertainty inference proof transaction anonymization. In VLDB, pp. 1033–1044, 2010.

[2] G. Ghinita, Y. Tao, and P. Kalnis. On the anonymization of sparse high-dimensional data. In ICDE, pp. 715–724, 2008.

[3] Y. He and J. F. Naughton. Anonymization of set-valued data via top-down, local generalization. In VLDB, pp. 934–945, 2009.

[4] M. Terrovitis, N. Mamoulis, and P. Kalnis. Privacy-preserving anonymization of setvalued data. In VLDB, pp.115–125, 2008.

[5] M. Terrovitis, N. Mamoulis, and P. Kalnis. Local and global recoding methods for anonymizing set-valued data.VLDBJ, 20(1):83–106, 2011.

[6] Y. Xu, B. C. M. Fung, K. Wang, A. W. C. Fu, and J. Pei. Publishing sensitive transactions for itemset utility. In ICDM, pp. 1109–1114, 2008.

[7] Y. Xu, K. Wang, A. W. C. Fu, and P. S. Yu. Anonymizing transaction databases for publication. In SIGKDD, pp. 767–775, 2008.

28

References

[8] S. R. Ganta, S. P. Kasiviswanathan, and A. Smith. Composition attacks and auxiliary information in data privacy. In SIGKDD, pp. 265-273, 2008.

[9] D. Kifer. Attacks on privacy and deFinetti’s theorem. In SIGMOD, pp. 127–138, 2009.

[10] R. C. W. Wong, A. Fu, K. Wang, P. S. Yu, and J. Pei. Can the utility of anonymized data be used for privacy breaches,” ACM Transactions on Knowledge Discovery

from Data, to appear.

[11] D. Kifer and A. Machanavajjhala. No free lunch in data privacy. In SIGMOD, 2011.

[12] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference, pp. 265–284, 2006.

[13] F. McSherry and K. Talwar. Mechanism design via differential privacy. In FOCS, pp.

94–103, 2007.

[14] F. McSherry. Privacy integrated queries: An extensible platform for privacypreserving data analysis. In SIGMOD, pp. 19–30, 2009.

[15] A. Blum, K. Ligett, and A. Roth. A learning theory approach to non-interactive database privacy. In STOC, pp.609–618, 2008.

29

References

[16] G. Cormode, M. Procopiuc, D. Srivastava, and T. T. L. Tran. Differentially Private

Publication of Sparse Data. In CoRR, 2011.

[17] N. Mohammed, R. Chen, B. C. M. Fung, and P. S. Yu. Differentially private data release for data mining. In SIGKDD, 2011.

[18] R. Chen, B. C. M. Fung, and B. C. Desai. Differentially private trajectory data publication. ICDE, under review, 2012.

30

Thank you!

Q & A

31

Backup Slides

32

Lower Bound Results

• In the interactive setting, only a limited number of queries could be answered; otherwise, an adversary would be able to precisely reconstruct almost the entire original database.

• In the non-interactive setting, one can only guarantee the utility of restricted classes of queries.

33

34

35

Threshold Selection

• We design the threshold as a function of the standard deviation of the noise and the height of a partition’s hierarchy cut :

2 C

2 height ( p .

cut ) / p .

36

Relative error

• (α, δ)-usefulness is effective to give an overall estimation of utility, but fails to produce intuitive experimental results.

• We experimentally measure the utility of sanitized data for counting queries by relative error :

| Q ( D *)

Q ( D ) max{ Q ( D ), s }

|

Sanity bound

37

Experiments

• Average relative error vs. taxonomy tree fan-out

B=0.5

B=0.75

B=1.0

38

Download