Differentially Private Transit Data Publication: A Case Study on the Montreal Transportation System Rui Chen Benjamin C. M. Fung Bipin C. Desai Nériah M. Sossou Concordia University ru_che@encs.concordia.ca Concordia University fung@ciise.concordia.ca Concordia University BipinC.Desai@concordia.ca Société de transport de Montréal Neriah.Sossou@stm.info ` Introduction Experimental Evaluation With the deployment of smart card automated fare collection systems, the société de transport de Montréal (STM, http://www.stm.info), the public transit agency in Montreal area, has been benefiting from the huge volume of transit data, a kind of sequential data, collected every day. We employ two real-life STM transit datasets: Metro contains 847,668 records over 68 metro stations; Bus contains 778,724 records over 944 bus stations. Table 1. Sample transit data. Li is a station in the transport system Rec. # t1 t2 t3 t4 t5 t6 t7 t8 Path L1 → L2 → L3 L1 → L2 L3 → L2 → L1 L1 → L2 → L4 L1 → L2 → L3 L3 → L2 L1 → L2 → L4 → L1 L3 → L1 Transit data needs to be shared for diverse reasons, such as profit sharing, business operations, and marketing analysis. However, publishing raw transit data may violate passengers’ privacy. Sanitization Algorithm We develop a practical sequential data publishing solution under differential privacy, supporting several key data analysis tasks. Our solution is composed of two steps: noisy prefix tree construction and private release generation. (c) max|Q| = 9 (d) max|Q| = 12 For count queries, our solution achieves small relative errors (Figure 2); for frequent sequential pattern mining, our solution maintains high true positives (TP) and low false positives (FP) / false drops (FD) (Table 2). Table 2. Utility for frequent sequential pattern mining on top k most frequent patterns 100 150 200 250 300 Noisy prefix tree construction: We represent a sequential dataset by a prefix tree and adaptively identify prefixes with sufficiently large supports. To reduce added noise, we introduce a hybrid-granularity structure, which decides whether to expand a prefix by first consulting the supports of more general locations. To improve the scalability of our solution, we design a statistical process under Laplace mechanism. An illustration is given in Figure 1. (b) max|Q| = 6 Figure 2. Average relative error vs. privacy budget under different query sizes |Q|. k Figure 1. Hybrid-granularity prefix tree for sanitization (a) max|Q| = 3 TP (M/B) Simple 99/97 143/139 178/168 209/195 241/212 FP (FD) (M/B) Simple 1/3 7/11 22/32 41/55 59/88 TP (M/B) Hybrid 100/100 149/144 185/177 220/209 257/233 FP (FD) (M/B) Hybrid 0/0 1/6 15/23 30/41 43/67 Our solution is efficient, with runtime complexity of O(|D|∙|L|), where |D| is the dataset size and |L| the universe size (i.e., total number of stations). Figure 3. Runtime vs. |D| and |L| Private release generation: We design mechanisms to remove inconsistencies in a noisy prefix tree in order to improve data utility. More specifically, we enforce two sets of consistency constraints in a prefix tree: 1)The noisy count of a node is less than or equal to that of its parent; 2)The noisy count of a node is greater than or equal to the sum of the noisy counts of its children. Conclusion We present a practical solution for sanitizing largescale sequential data. Our solution has been tested on real-life STM transit data and exhibits satisfactory effectiveness and efficiency. We believe that our solution could benefit many other sectors that are facing the dilemma between the demands of sequential data publishing and privacy protection. Paper ID: 77