Mosaic: Quantifying Privacy Leakage in Mobile Networks Ning Xia (Northwestern University) Han Hee Song (Narus Inc.) Yong Liao (Narus Inc.) Marios Iliofotou (Narus Inc.) Antonio Nucci (Narus Inc.) Zhi-Li Zhang (University of Minnesota) Aleksandar Kuzmanovic (Northwestern University) Scenario Different information Different services IP1 … IPK ISP A IP1 … IPK CSP B Dynamic IP, CSP/ISP Different devices 2 Problem Other research work IP1 … IPK ISP A IP1 … IPK CSP B Input packet traces Tessellation Mosaic We are here! How much private information can be obtained and expanded about end users by monitoring network traffic? 3 Motivation I will know everything about everyone! IP1 … IPK ISP A IP1 … IPK CSP B Agencies Bad guys Mobile Traffic: • Relevant: more personal information • Challenging: frequent IP changes 4 Challenges • How to track users when they hop over different IPs? Sessions: Flows(5-tuple) are grouped into sessions IP1 time Traffic Markers: IP2 time IP3 time Identifiers in the traffic that can be used to differentiate users With Traffic Markers, it is possible to connect the users’ true identities to their sessions. 5 Datasets Dataset Source Description 3h-Dataset CSP-A Complete payload 9h-Dataset CSP-A Only HTTP headers Ground Truth Dataset CSP-B Payload & RADUIS info. • • • 3h-Dataset: main dataset for most experiments 9h-Dataset: for quantifying privacy leakage Ground Truth Dataset: for evaluation of session attribution • RADIUS: provide session owners 6 Methodology Overview … IPK ISP A IP1 … IPK CSP B IP1 Tessellation Via traffic markers Traffic attribution Via activity fingerprinting Mapping from sessions to users Network data analysis Mosaic construction Web crawling Combine information from both network data and OSN profiles to infer the user mosaic. 7 Traffic Attribution via Traffic Markers Traffic Markers: • Identifiers in the traffic to differentiate users • Key/value pairs from HTTP header • User IDs, device IDs or sessions IDs Domain Keywords Category Source osn1.com c_user=<OSN1_ID> OSN User ID Cookies osn2.com oauth_token=<OSN2_ID>-## OSN User ID HTTP header admob.com X-Admob-ISU Advertising HTTP header pandora.com user_id User ID Cookies google.com sid Session ID Cookies How can we select and evaluate traffic markers from network data? 8 Traffic Attribution via Traffic Markers OSN IDs as Anchors: • • The most popular user identifiers among all services Linked to user public profiles OSN Source Session Coverage OSN1 ID HTTP URL and cookies 1.3% OSN2 ID HTTP header 1.0% Top 2 OSN providers from North America Only 2.3% sessions contain OSN IDs OSN IDs can be used as anchors, but their coverage on sessions is too small 9 Traffic Attribution via Traffic Markers Block Generation: Group Sessions into Blocks OSN ID Other sessions? ≥δ Session interval δ • Depends on the CSP • δ=60 seconds in our study time IP1 IP Block IP IP 1 time Block • Session group on the same IP within a short period of time • Traffic markers shared by the same block 99K session blocks generated from the 12M sessions 10 Traffic Attribution via Traffic Markers Culling the Traffic Markers: OSN IDs are not enough • Uniqueness: Can the traffic marker differentiate between users? • Persistency: How long does a traffic marker remain the same? Uniqueness Persistency Uniqueness = 1 No two users will share the same google.com#sid value Score 1 0.98 0.96 0.94 0.92 craigslist.org craigslist.org #cl_b#cl_b google.com google.com #sid #sid mydas.mobi mydas.mobi #mac-id #mac-id mobclix.com mobclix.com #u #u pandora.com pandora.com #user_id #user_id mobclix.com #uid #uid OSN1 ID admob.com #isu mobclix.com 0.9 Traffic markers Traffic markers Persistency ~= 1 The value of Google.com#sid remains the same for the same user nearly all the observation duration We pick 625 traffic markers with uniqueness = 1, persistency > 11 Traffic Attribution via Traffic Markers Traffic Attribution: Connecting the Dots Tessellation User Ti ( ) IP 1 Same OSN ID IP 2 Same traffic marker IP 3 Traffic markers are the key in attributing sessions to the same user over different IP addresses 12 Traffic Attribution via Activity Fingerprinting • What if a session block has no traffic markers? Assumption (Activity Fingerprinting): • Users can be identified from the DNS names of their favorite services DNS names: Service classes Service providers • Extracted 54,000 distinct DNS names • Classified into 21 classes Search bing, google, yahoo Chat skype, mtalk.googl.com Dating plentyoffish, date E-commerce amazon, ebay Activity Fingerprinting: Email google, hotmail, yahoo • Favorite (top-k) DNS names as the user’s “fingerprint” News msnbc, ew, cnn Picture Flickr, picasa … … 13 Traffic Attribution via Activity Fingerprinting Y(Fi) • Fi : Top k DNS names from user as “activity fingerprint” • : Uniqueness of the fingerprint 1 0.98 0.96 0.94 0.92 0.9 Y-axis: closer to 1, more distinct the fingerprint is k=4 k=5 k=6 k=7 k=8 0 0.2 0.4 0.6 0.8 Normalizedfingerprint DNS namesIDs Normalized 1 X-axis: normalized by the total number of DNS names Mobile users can be identified by the DNS names from their preferred services 14 Traffic Attribution Evaluation Correct (Not complete) Not correct Session Ri RADIUS user (Ground Truth) Ti Tessellation user (Correct?) Ti Tj Ri identified sessions/users Coverage = ----------------------total sessions/users Rj correctly identified Accuracy on sessions/users = ------------------Covered Set total identified sessions/users 15 Traffic Attribution Evaluation Coverage Evaluation Results 15.70% User Session 43.20% 2.40% OSN ID extraction Accuracy on Covered Set • Via traffic markers 49.80% 78.60% Via activity fingerprinting 100% 99.30% 96.40% User 100% 94.50% 92.50% Session OSN ID extraction 69.00% Via traffic markers Via activity fingerprinting 16 Construction of User Mosaic • Mosaic of Real User Sub-classes: Residence, coordinates, city, state, and etc. Least gain Most gain MOSAIC with 12 information classes(tesserae): • Information (Education, affiliation and etc.) from OSN profiles • Information (Locations, devices and etc.) from users’ network data 17 Quantifying Privacy Leakage • Leakage from OSN profiles vs. from Network Data 15000 12000 # of Users OSN profiles provide static user information (education, interests) Both public OSN profiles & activity analysis Public OSN profiles only Activity analysis only 9000 Analysis on network data provides real-time activities and locations 6000 3000 0 News_info. Content_exch. Entertainment Affiliation E-commerce Education Art_culture Location Social_actvty Association Demographics Device_info. Information from both sides can corroborate to each other Information from OSN profiles and network data can complete and corroborate each other 18 Preventing User Privacy Leakage Protect traffic markers • Traffic markers (OSN IDs and etc.) should be limited and encrypted Restrict 3rd parties • Third party applications/developers should be strongly regulated Protect user profiles • OSN public profiles should be carefully obfuscated 19 Conclusions • Prevalence in the use of OSNs leaves users’ true identities available in the network • Tracking techniques used by mobile apps and services make traffic attribution easier • Sessions can be labeled with network users’ true identities, even without any identity leaks • Various types of information can be gleaned to paint rich digital Mosaic about users 20 Mosaic: Quantifying Privacy Leakage in Mobile Network Thanks! 21