ReCon: Revealing and Controling PII Leaks in Mobile Network Systems David Choffnes, Northeastern University Jingjing Ren, Northeastern University Ashwin Rao, University of Helsinki Martina Lindorfer, Vienna Univ. of Technology Arnaud Legout, INRIA Sophia-Antipolis Sponsored by: DTL Workshop, Nov. 2015 Motivation 2 Mobile devices Rich sensors Ubiquitous connectivity Key questions What personal information is transmitted? To whom does it go? What can average users do about it? How Frequently Is PII Leaked? 3 Fraction of top 100 apps leaking PII App Store Google Play 0.6 WP Store Basic tracking is common 0.5 0.4 0.3 0.2 0.1 Significant fraction of very personal information leaked across all platforms PII leakage is pervasive! 0 User Identifier (email, name, gender etc.) Contact Info Location Credential (username, password) (Tested in September, 2015) Device Identifier (IMEI, Advertiser ID, MAC etc.) How to Detect PII Leaks in Mobile? 4 At the OS Information flow analysis (static/dynamic/hybrid) Ok solutions, but not perfect or easily deployable In the network Independent of OS, app store Easy to detect if you know what PII to search for What if you don’t know the PII a priori? ReCon: Automatically Identifying PII Leaks 5 Hypothesis: PII leaks have distinguishing characteristics Is it just simple key/value pairs (e.g., “user=R3C0N”)? Nope, this leads to high FP/FN rates Need to learn the structure of PII leaks Approach: Build ML classifiers to reliably detect leaks Does not require knowing PII in advance Resilient to changes in PII leak formats over time We built ReCon Machine learning to reveal PII leaks from mobile devices Software middleboxes to intercept and control leaks Works on all major platforms (iOS, Android, Windows Phone) ReCon: Viewing detected leaks 6 PII Category Device Identifiers Contact Information User Identifiers Credentials User Feedback Correct Incorrect Not sure Not about me Where They Know You’ve Been 7 Location information is hard to digest using text alone WTKYB shows just how pervasive location tracking is Creepiness factor to help users care more about privacy(?) Mitigating PII Leaks 8 ReCon gives users control over leaks Example simple strategies Block PII Modify PII Randomize identifiers Coarsen locations Advanced Mock mitigation (under dev) user profiles Provide k-anonymity How does ReCon work? 9 Key challenges for ML-based PII detection Which classifier do we use? C4.5 Decision Tree is best trade-off between speed and accuracy How do we train the classifier? Use traces from real users and controlled experiments Break flows into separate words that may indicate a leak Feature selection for scalability How well are we doing? Controlled experiments In the wild: Only the users themselves know for sure! Crowdsourced reinforcement Key Results: ReCon accuracy 10 How accurate is ReCon? 99% overall accuracy from controlled experiments FPR: 2.2%, FNR: 3.5% Why? Per-domain classifiers Decision tree captures non-trivial cases Key Results: ReCon Has Good Coverage 11 How does it compare to other solutions? FlowDroid(Static IFA) Andrubis (Dynamic IFA) 100.00% ReCon ReCon finds significantly more PII than IFA solutions 90.00% Fraction of total leaks found AppAudit(Hybrid IFA) 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% ReCon successfully idenifies missing leaks after retraining 10.00% 0.00% Device Identifier User Identifier Contact Info Location Key Results: User study 12 IRB-approved user study 24 iOS, 13 Android devices 20/26 responses: system useful & behavior change 165 cases of credential leaks, 94 verified Average leaks: iOS > Android Unexpected, suspicious leaks Recipe/cooking app tracks location Video/Game/News app leaks gender And more… Check out http://recon.meddle.mobi Summary 13 ReCon: Provides transparency/control over PII leaks Relies only on access to network traffic (OS independent) Machine learning to automatically identify PII leaks Crowdsourced reinforcement with user feedback Works today! Check out http://recon.meddle.mobi Questions? Sponsor: David Choffnes choffnes@ccs.neu.edu Backups 14 Encryption and ReCon 15 What is your answer for increasing use of encryption? Recon needs access only to plaintext flows mcTLS, BlindBox Route to trusted middlebox that can do MITM Works for most apps, but usually not logins Haystack (on Android device) Encryption: What is leaked? 16 Leaks over SSL (not much) Send PII to trackers over SSL (100 apps/device) 6 iOS 2 Android 1 Windows Problem with SSL Certification pining Not working with VPN enabled Obfuscation Little evidence in controlled experiment using IFA Other applications of ReCon 17 K-anonymity Explicit sharing Allow users to control how much shared to third-parties Obfuscation Retrain classifiers to identify obfuscated leaks Use static/dynamic to analysis tools that are resilient to evasion techniques Deployment models 18 ReCon only needs access to network flows VPN proxy (current deployment): tunnel to proxy server Currently supported by all mobile OSes Can run VMs anywhere in the world Raspberry Pi In home network Enables HTTPS decryption with minimal additional risks On device Haystack In on Android network Awazza and other APN/middlebox deployment models Methodology Details 19 Controlled experiments as ground truth Text classification approaches Problem: Given a network flow, whether it contains PII information or not? Feature Extraction: Bag-of-word model Example.com /someevent?x=1&y=2 {“z”:”xx@y”} Words: someevent, x, 1, y, 2, z, xx@y, Per-Domain classifiers (e.g. Google-Analytics) Faster (compared to one-for-all) More accurate Library: weka Why Run ReCon? 20 User incentives Control over data leaks! Blocking unwanted content k-anonymity for increased privacy