Differentially Private Transit Data Publication: A Case Study on the Montreal Transportation System Rui Chen, Concordia University Benjamin C. M. Fung, Concordia University Bipin C. Desai, Concordia University Nériah M. Sossou, Société de transport de Montréal Outline 2 Introduction Related Work Preliminaries Sanitization Algorithm Experimental Results Conclusion 2 The STM Story 3 The Société de transport de Montréal (STM) is the public transit agency in Montreal area. The smart card automated fare collection system generates and collects huge volume of transit data every day. Transit data needs to be shared for many reasons. 3 Transit Data 4 Transit data, a kind of sequential data, consists of sequences of time-ordered locations. A station in the STM network 4 Privacy Threats 5 Alice visited L4 and then L1 5 Privacy Threats 6 Alice also visited L2 … 6 Differential Privacy [1] 7 PrM[M(D) = D*] ≤ exp(ε) × PrM[M(D’) = D*] 7 Technical Challenges 8 Suppose there are 1,000 stations in the STM network Suppose the maximum number of stations visited by a passenger is 20 Traditional differentially private mechanisms are dataindependent: 20 i 60 sequences to consider! 1000 10 i 1 Computationally infeasible 8 Contributions 9 The first practical solution for publishing real-life sequential data under differential privacy A study of the real-life transit data sharing scenario at the STM The use of a hybrid-granularity prefix tree for datadependent publication and an efficient implementation based on a statistical process Enforcement of two sets of consistency constraints Seamless extension to trajectory data 9 Outline 10 Introduction Related Work Preliminaries Sanitization Algorithm Experimental Results Conclusion 10 Related Work 11 Abul et al. [2] achieves (k, δ)-anonymity by space translation. Terrovitis and Mamoulis [3] limits an adversary’s confidence of inferring the presence of a location by global suppression. Yarovoy et al. [4] k-anonymize a moving object database (MOD) by considering timestamps as the quasi-identifiers. Chen et al. [5] achieves the (K, C)L-privacy model by local suppression. Is it possible to employ a much stronger privacy model while achieving desirable utility? 11 Outline 12 Introduction Related Work Preliminaries Sanitization Algorithm Experimental Results Conclusion 12 Laplace Mechanism [1] 13 ε/(2 Δf) ε: privacy parameter (privacy budget) Δf: global sensitivity (e.g., the maximum change of f due to the change of a single record). 13 Composition Properties 14 Sequential composition ∑iεi –differential privacy Parallel composition max(εi)–differential privacy 14 Prefix Tree 15 A simple but effective way to explore the entire output domain 15 Utility Requirements 16 Count query: E.g., how many passengers have visited both Guy-Concordia and McGill stations? Frequent sequential pattern mining: E.g., what are the most popular sequences of stations being visited? 16 Outline 17 Introduction Related Work Preliminaries Sanitization Algorithm Experimental Results Conclusion 17 Sanitization Algorithm 18 Complexity: 18 Noisy Prefix Tree 19 Each level consists of two sub-levels with different location granularities Each level receives ε/h privacy budget 19 19 Efficient Implementation 20 Separately handle empty and non-empty nodes 20 Hybrid-Granularity 21 Or For an empty node on level i, we reduce noise by a factor of 21 Consistency Constraints 22 For any root-to-leaf path p, where vi is a child of vi+1. For each node v, 22 Consistency Enforcement 23 Constraint Type Ⅰ [6] Constraint Type Ⅱ 23 Outline 24 Introduction Related Work Preliminaries Sanitization Algorithm Experimental Results Conclusion 24 STM Datasets 25 Real-life STM datasets are used for evaluation: Datasets |D| |L| max|S| avg|S| Metro 847,668 68 90 4.21 Bus 778,724 944 121 5.67 25 Average Relative Error vs. ε 26 26 Average Relative Error vs. ε 27 27 Average Relative Error vs. h 28 28 Average Relative Error vs. h 29 29 Utility vs. k 30 100 TP (M/B) Simple 99/97 FP (FD) (M/B) Simple 1/3 TP (M/B) Hybrid 100/100 FP (FD) (M/B) Hybrid 0/0 150 200 250 300 143/139 178/168 209/195 241/212 7/11 22/32 41/55 59/88 149/144 185/177 220/209 257/233 1/6 15/23 30/41 43/67 k 30 Utility vs. ε 31 0.5 TP (M/B) Simple 227/194 FP (FD) (M/B) Simple 73/106 TP (M/B) Hybrid 244/215 FP (FD) (M/B) Hybrid 56/85 0.75 1.0 1.25 1.5 239/206 241/212 243/216 248/224 61/94 59/88 57/84 52/76 253/224 257/233 259/238 261/242 47/76 43/67 41/62 39/58 ε 31 Utility vs. h 32 h 6 8 10 12 14 16 18 20 TP (M/B) Simple 234/212 240/217 241/215 241/212 241/212 240/210 240/209 238/206 FP (FD) (M/B) Simple 66/88 60/83 58/85 59/88 59/88 60/90 60/91 62/94 TP (M/B) Hybrid 241/221 254/232 255/236 257/233 258/233 258/231 255/230 254/228 FP (FD) (M/B) Hybrid 59/79 46/68 45/64 43/67 42/67 42/69 45/70 46/72 32 Scalability 33 33 Outline 34 Introduction Related Work Preliminaries Sanitization Algorithm Experimental Results Conclusion 34 Conclusion 35 It is possible to publish useful transit data (sequential data) under differential privacy. Generally, a data-dependent solution outperforms a dataindependent solution. It is worth exploring the utility of released data for more complex data analysis tasks. It is important to educate transport service practitioners. 35 References 36 C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In TCC, 2006. O. Abul, F. Bonchi, and M. Nanni. Never walk alone: Uncertainty for anonymity in moving objects databases. In ICDE, 2008. M. Terrovitis and N. Mamoulis. Privacy preservation in the publication of trajectories. In MDM, 2008. R. Yarovoy, F. Bonchi, L. V. S. Lakshmanan, and W. H. Wang. Anonymizing moving objects: How to hide a MOB in a crowd? In EDBT, 2009. R. Chen, B. C. M. Fung, N. Mohammed, and B. C. Desai. Privacypreserving trajectory data publishing by local suppression. Information Sciences, in press. M. Hay, V. Rastogi, G. Miklau, and D. Suciu. Boosting the accuracy of differentially private histograms through consistency. PVLDB, 2010. 36 37 Thank You Very Much Q&A 37 38 Back-up Slides 38 Detailed Algorithm 39 39