poster - Data Mining and Security Lab @ McGill

advertisement
Differentially Private Transit Data Publication: A Case
Study on the Montreal Transportation System
Rui Chen
Benjamin C. M. Fung
Bipin C. Desai
Nériah M. Sossou
Concordia University
ru_che@encs.concordia.ca
Concordia University
fung@ciise.concordia.ca
Concordia University
BipinC.Desai@concordia.ca
Société de transport de
Montréal
Neriah.Sossou@stm.info
`
Introduction
Experimental Evaluation
With the deployment of smart card automated fare
collection systems, the société de transport de
Montréal (STM, http://www.stm.info), the public transit
agency in Montreal area, has been benefiting from the
huge volume of transit data, a kind of sequential data,
collected every day.
We employ two real-life STM transit datasets: Metro
contains 847,668 records over 68 metro stations; Bus
contains 778,724 records over 944 bus stations.
Table 1. Sample transit data. Li is a station in the transport system
Rec. #
t1
t2
t3
t4
t5
t6
t7
t8
Path
L1 → L2 → L3
L1 → L2
L3 → L2 → L1
L1 → L2 → L4
L1 → L2 → L3
L3 → L2
L1 → L2 → L4 → L1
L3 → L1
Transit data needs to be shared for diverse reasons,
such as profit sharing, business operations, and
marketing analysis. However, publishing raw transit
data may violate passengers’ privacy.
Sanitization Algorithm
We develop a practical sequential data publishing
solution under differential privacy, supporting several
key data analysis tasks. Our solution is composed of
two steps: noisy prefix tree construction and private
release generation.
(c) max|Q| = 9
(d) max|Q| = 12
For count queries, our solution achieves small relative
errors (Figure 2); for frequent sequential pattern mining,
our solution maintains high true positives (TP) and low
false positives (FP) / false drops (FD) (Table 2).
Table 2. Utility for frequent sequential pattern mining on top k most
frequent patterns
100
150
200
250
300
Noisy prefix tree construction:
We represent a sequential dataset by a prefix tree and
adaptively identify prefixes with sufficiently large
supports. To reduce added noise, we introduce a
hybrid-granularity structure, which decides whether to
expand a prefix by first consulting the supports of more
general locations. To improve the scalability of our
solution, we design a statistical process under Laplace
mechanism. An illustration is given in Figure 1.
(b) max|Q| = 6
Figure 2. Average relative error vs. privacy budget under different
query sizes |Q|.
k
Figure 1. Hybrid-granularity prefix tree for sanitization
(a) max|Q| = 3
TP (M/B)
Simple
99/97
143/139
178/168
209/195
241/212
FP (FD) (M/B)
Simple
1/3
7/11
22/32
41/55
59/88
TP (M/B)
Hybrid
100/100
149/144
185/177
220/209
257/233
FP (FD) (M/B)
Hybrid
0/0
1/6
15/23
30/41
43/67
Our solution is efficient, with runtime complexity of
O(|D|∙|L|), where |D| is the dataset size and |L| the
universe size (i.e., total number of stations).
Figure 3. Runtime vs. |D| and |L|
Private release generation:
We design mechanisms to remove inconsistencies in a
noisy prefix tree in order to improve data utility. More
specifically, we enforce two sets of consistency
constraints in a prefix tree:
1)The noisy count of a node is less than or equal to
that of its parent;
2)The noisy count of a node is greater than or equal to
the sum of the noisy counts of its children.
Conclusion
We present a practical solution for sanitizing largescale sequential data. Our solution has been tested on
real-life STM transit data and exhibits satisfactory
effectiveness and efficiency. We believe that our
solution could benefit many other sectors that are
facing the dilemma between the demands of
sequential data publishing and privacy protection.
Paper ID: 77
Download