Slides - Data Mining and Security Lab @ McGill

advertisement
Differentially Private Transit Data
Publication: A Case Study on the Montreal
Transportation System
Rui Chen, Concordia University
Benjamin C. M. Fung, Concordia University
Bipin C. Desai, Concordia University
Nériah M. Sossou, Société de transport de Montréal
Outline
2

Introduction

Related Work

Preliminaries

Sanitization Algorithm

Experimental Results

Conclusion
2
The STM Story
3

The Société de transport de
Montréal (STM) is the public
transit agency in Montreal area.

The smart card automated fare
collection system generates and
collects huge volume of transit
data every day.

Transit data needs to be shared
for many reasons.
3
Transit Data
4

Transit data, a kind of sequential data, consists of sequences
of time-ordered locations.
A station in the STM network
4
Privacy Threats
5
Alice visited L4
and then L1
5
Privacy Threats
6
Alice also visited
L2 …
6
Differential Privacy [1]
7
PrM[M(D) = D*] ≤ exp(ε) × PrM[M(D’) = D*]
7
Technical Challenges
8

Suppose there are 1,000 stations in the STM network

Suppose the maximum number of stations visited by a
passenger is 20

Traditional differentially private mechanisms are dataindependent:
20
i
60 sequences to consider!
1000

10

i 1
Computationally
infeasible
8
Contributions
9

The first practical solution for publishing real-life sequential
data under differential privacy

A study of the real-life transit data sharing scenario at
the STM

The use of a hybrid-granularity prefix tree for datadependent publication and an efficient implementation
based on a statistical process

Enforcement of two sets of consistency constraints

Seamless extension to trajectory data
9
Outline
10

Introduction

Related Work

Preliminaries

Sanitization Algorithm

Experimental Results

Conclusion
10
Related Work
11

Abul et al. [2] achieves (k, δ)-anonymity by space translation.

Terrovitis and Mamoulis [3] limits an adversary’s confidence
of inferring the presence of a location by global suppression.

Yarovoy et al. [4] k-anonymize a moving object database
(MOD) by considering timestamps as the quasi-identifiers.

Chen et al. [5] achieves the (K, C)L-privacy model by local
suppression.
Is it possible to employ a much stronger privacy model
while achieving desirable utility?
11
Outline
12

Introduction

Related Work

Preliminaries

Sanitization Algorithm

Experimental Results

Conclusion
12
Laplace Mechanism [1]
13
ε/(2 Δf)
ε: privacy parameter (privacy budget)
Δf: global sensitivity (e.g., the maximum change
of f due to the change of a single record).
13
Composition Properties
14
Sequential composition
∑iεi –differential privacy
Parallel composition
max(εi)–differential privacy
14
Prefix Tree
15
A simple but effective way to explore the entire output domain
15
Utility Requirements
16

Count query:
E.g., how many passengers have visited both Guy-Concordia
and McGill stations?

Frequent sequential pattern mining:
E.g., what are the most popular sequences of stations being
visited?
16
Outline
17

Introduction

Related Work

Preliminaries

Sanitization Algorithm

Experimental Results

Conclusion
17
Sanitization Algorithm
18
Complexity:
18
Noisy Prefix Tree
19

Each level consists of two sub-levels with
different location granularities

Each level receives ε/h privacy budget
19
19
Efficient Implementation
20

Separately handle empty and non-empty nodes
20
Hybrid-Granularity
21
Or
For an empty node on level i, we reduce noise by a factor of
21
Consistency Constraints
22

For any root-to-leaf path p,
where vi is a child of vi+1.

For each node v,
22
Consistency Enforcement
23

Constraint Type Ⅰ [6]

Constraint Type Ⅱ
23
Outline
24

Introduction

Related Work

Preliminaries

Sanitization Algorithm

Experimental Results

Conclusion
24
STM Datasets
25

Real-life STM datasets are used for evaluation:
Datasets
|D|
|L|
max|S|
avg|S|
Metro
847,668
68
90
4.21
Bus
778,724
944
121
5.67
25
Average Relative Error vs. ε
26
26
Average Relative Error vs. ε
27
27
Average Relative Error vs. h
28
28
Average Relative Error vs. h
29
29
Utility vs. k
30
100
TP (M/B)
Simple
99/97
FP (FD) (M/B)
Simple
1/3
TP (M/B)
Hybrid
100/100
FP (FD) (M/B)
Hybrid
0/0
150
200
250
300
143/139
178/168
209/195
241/212
7/11
22/32
41/55
59/88
149/144
185/177
220/209
257/233
1/6
15/23
30/41
43/67
k
30
Utility vs. ε
31
0.5
TP (M/B)
Simple
227/194
FP (FD) (M/B)
Simple
73/106
TP (M/B)
Hybrid
244/215
FP (FD) (M/B)
Hybrid
56/85
0.75
1.0
1.25
1.5
239/206
241/212
243/216
248/224
61/94
59/88
57/84
52/76
253/224
257/233
259/238
261/242
47/76
43/67
41/62
39/58
ε
31
Utility vs. h
32
h
6
8
10
12
14
16
18
20
TP (M/B)
Simple
234/212
240/217
241/215
241/212
241/212
240/210
240/209
238/206
FP (FD) (M/B)
Simple
66/88
60/83
58/85
59/88
59/88
60/90
60/91
62/94
TP (M/B)
Hybrid
241/221
254/232
255/236
257/233
258/233
258/231
255/230
254/228
FP (FD) (M/B)
Hybrid
59/79
46/68
45/64
43/67
42/67
42/69
45/70
46/72
32
Scalability
33
33
Outline
34

Introduction

Related Work

Preliminaries

Sanitization Algorithm

Experimental Results

Conclusion
34
Conclusion
35

It is possible to publish useful transit data (sequential data)
under differential privacy.

Generally, a data-dependent solution outperforms a dataindependent solution.

It is worth exploring the utility of released data for more
complex data analysis tasks.

It is important to educate transport service practitioners.
35
References
36






C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to
sensitivity in private data analysis. In TCC, 2006.
O. Abul, F. Bonchi, and M. Nanni. Never walk alone: Uncertainty for
anonymity in moving objects databases. In ICDE, 2008.
M. Terrovitis and N. Mamoulis. Privacy preservation in the publication
of trajectories. In MDM, 2008.
R. Yarovoy, F. Bonchi, L. V. S. Lakshmanan, and W. H. Wang.
Anonymizing moving objects: How to hide a MOB in a crowd? In EDBT,
2009.
R. Chen, B. C. M. Fung, N. Mohammed, and B. C. Desai. Privacypreserving trajectory data publishing by local suppression. Information
Sciences, in press.
M. Hay, V. Rastogi, G. Miklau, and D. Suciu. Boosting the accuracy of
differentially private histograms through consistency. PVLDB, 2010.
36
37
Thank You Very Much
Q&A
37
38
Back-up Slides
38
Detailed Algorithm
39
39
Download