Slides - Data Transparency Lab Conference

advertisement
ReCon: Revealing and
Controling PII Leaks in
Mobile Network Systems
David Choffnes, Northeastern University
Jingjing Ren, Northeastern University
Ashwin Rao, University of Helsinki
Martina Lindorfer, Vienna Univ. of Technology
Arnaud Legout, INRIA Sophia-Antipolis
Sponsored by:
DTL Workshop, Nov. 2015
Motivation
2

Mobile devices
 Rich
sensors
 Ubiquitous connectivity

Key questions
 What
personal information is transmitted?
 To whom does it go?
 What can average users do about it?
How Frequently Is PII Leaked?
3
Fraction of top 100 apps leaking PII
App Store
Google Play
0.6
WP Store
Basic tracking is common
0.5
0.4
0.3
0.2
0.1
Significant fraction of very
personal information leaked
across all platforms
PII leakage is pervasive!
0
User Identifier
(email, name,
gender etc.)
Contact Info
Location
Credential
(username,
password)
(Tested in September, 2015)
Device Identifier
(IMEI, Advertiser
ID, MAC etc.)
How to Detect PII Leaks in Mobile?
4

At the OS
 Information
flow analysis (static/dynamic/hybrid)
 Ok solutions, but not perfect or easily deployable

In the network
 Independent
of OS, app store
 Easy to detect if you know what PII to search for
What if you don’t know the PII a priori?
ReCon: Automatically Identifying PII Leaks
5

Hypothesis: PII leaks have distinguishing characteristics

Is it just simple key/value pairs (e.g., “user=R3C0N”)?



Nope, this leads to high FP/FN rates
Need to learn the structure of PII leaks
Approach: Build ML classifiers to reliably detect leaks
Does not require knowing PII in advance
 Resilient to changes in PII leak formats over time


We built ReCon
Machine learning to reveal PII leaks from mobile devices
 Software middleboxes to intercept and control leaks
 Works on all major platforms (iOS, Android, Windows Phone)

ReCon: Viewing detected leaks
6
 PII
Category
 Device
Identifiers
 Contact Information
 User Identifiers
 Credentials
 User
Feedback
 Correct
 Incorrect
 Not
sure
 Not about me
Where They Know You’ve Been
7
 Location
information is hard to
digest using text alone
 WTKYB
shows just how pervasive
location tracking is
 Creepiness
factor to help users
care more about privacy(?)
Mitigating PII Leaks
8
 ReCon
gives users control over leaks
 Example
simple strategies
 Block
PII
 Modify PII
 Randomize identifiers
 Coarsen locations
 Advanced
 Mock
mitigation (under dev)
user profiles
 Provide k-anonymity
How does ReCon work?
9

Key challenges for ML-based PII detection

Which classifier do we use?


C4.5 Decision Tree is best trade-off between speed and accuracy
How do we train the classifier?
Use traces from real users and controlled experiments
 Break flows into separate words that may indicate a leak
 Feature selection for scalability


How well are we doing?
Controlled experiments
 In the wild: Only the users themselves know for sure!


Crowdsourced reinforcement
Key Results: ReCon accuracy
10

How accurate is ReCon?
 99%
overall accuracy from controlled experiments
 FPR: 2.2%, FNR: 3.5%
 Why?
 Per-domain
classifiers
 Decision tree captures non-trivial cases
Key Results: ReCon Has Good Coverage
11

How does it compare to other solutions?
FlowDroid(Static IFA)
Andrubis (Dynamic IFA)
100.00%
ReCon
ReCon finds significantly
more PII than IFA solutions
90.00%
Fraction of total leaks found
AppAudit(Hybrid IFA)
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
ReCon successfully idenifies missing
leaks after retraining
10.00%
0.00%
Device Identifier
User Identifier
Contact Info
Location
Key Results: User study
12

IRB-approved user study
 24
iOS, 13 Android devices
 20/26 responses: system useful & behavior change
 165 cases of credential leaks, 94 verified
 Average leaks: iOS > Android 
 Unexpected, suspicious leaks
 Recipe/cooking
app tracks location
 Video/Game/News app leaks gender
 And
more…
 Check
out http://recon.meddle.mobi
Summary
13

ReCon: Provides transparency/control over PII leaks
 Relies
only on access to network traffic (OS independent)
 Machine learning to automatically identify PII leaks
 Crowdsourced reinforcement with user feedback
 Works today! Check out http://recon.meddle.mobi
Questions?
Sponsor:
David Choffnes
choffnes@ccs.neu.edu
Backups
14
Encryption and ReCon
15

What is your answer for increasing use of encryption?
 Recon
needs access only to plaintext flows
 mcTLS, BlindBox
 Route to trusted middlebox that can do MITM
 Works
for most apps, but usually not logins
 Haystack
(on Android device)
Encryption: What is leaked?
16

Leaks over SSL (not much)
 Send
PII to trackers over SSL (100 apps/device)
6
iOS
 2 Android
 1 Windows
 Problem
with SSL
 Certification
pining
 Not working with VPN enabled

Obfuscation
 Little
evidence in controlled experiment using IFA
Other applications of ReCon
17


K-anonymity
Explicit sharing
 Allow

users to control how much shared to third-parties
Obfuscation
 Retrain
classifiers to identify obfuscated leaks
 Use static/dynamic to analysis tools that are resilient to
evasion techniques
Deployment models
18

ReCon only needs access to network flows
 VPN
proxy (current deployment): tunnel to proxy server
 Currently
supported by all mobile OSes
 Can run VMs anywhere in the world
 Raspberry
Pi
 In
home network
 Enables HTTPS decryption with minimal additional risks
 On
device
 Haystack
 In
on Android
network
 Awazza
and other APN/middlebox deployment models
Methodology Details
19


Controlled experiments as ground truth
Text classification approaches
 Problem:
Given a network flow, whether it contains PII
information or not?
 Feature Extraction: Bag-of-word model
 Example.com
/someevent?x=1&y=2 {“z”:”xx@y”}
 Words: someevent, x, 1, y, 2, z, xx@y,
 Per-Domain
classifiers (e.g. Google-Analytics)
 Faster
(compared to one-for-all)
 More accurate
 Library:
weka
Why Run ReCon?
20

User incentives
 Control
over data leaks!
 Blocking unwanted content
 k-anonymity for increased privacy
Download