Unintended Consequences of Data Sharing Laws and Rules Sam Weber Software Engineering Institute, CMU Thesis Laws and regulations concerning data sharing and privacy often have unintended consequences Problem Space • Three dimensions of issue: – What is ethical? – What is legal? • Usually, laws are an attempt to codify ethical rules. – Utility • What are people trying to accomplish? • Often dimensions conflict: “No useful database can ever be perfectly anonymous, and as the utility of data increases, the privacy decreases.” • Paul Ohm, “Broken Promises of Privacy: Responding to The Surprising Failure of Anonymization” Common Issues • Sharing of cybersecurity data – Can defenders collect/share personal information about people “for good”? • Health care data – Who has what rights over a patient’s medical data? • Patient, doctors and insurance companies all have different interests – What rules should protect people whose data is used for medical research? • General – What is personally identifiable information? • What is privacy? – What about data that has already “gone out”? – What about data that can be inferred from already revealed information? – How do deal with changed laws/technology? • When laws/technology changes, what happens to existing databases? • What happens in cases of international data sharing? Unintended Consequences • Three real-world situations: 1. Medical research and skull-stripping 2. Botnets and intrusion data 3. Burglar/Girls Around Me applications Medical Research • Biomedical research strongly dependent on access to health data – ex: brain scans for alzheimer’s research • Want to protect privacy of people whose data is used – Two general approaches: • Privacy/USA: generally industry-specific (HIPAA, Driver’s Privacy Protection Act,…) – HIPAA identifies Personally Identifying Information • Privacy/EU: global, Data Protection Directive – PII “anything that can be used to identify you” – First is often ineffective, second is unstable Skull Stripping and Biomedical Research • HIPAA Privacy Rule regulations permit use/disclosure of data that have been removed of patient identifiers w/o authorization – Informed consent otherwise difficult to obtain/manage • Problem: from MRI of head, can reconstruct face • Solution: “skull stripping”/“defacing” algorithms • (ex: Bischoff-Grethe et. al. “A technique for the deidentification of structural brain MR image”) • Notice: NO real threat! – Probably entirely wasted effort 7 Botnets • Situation: Researcher discovers botnet C&C on university’s machine – Allows botnet to continue to run, but observes it – Discovers how botnet works and finds ways to defeat it • First question: Is this ethical? – Con: Researcher-controlled machine is knowingly attacking innocent people – Pro: Researcher isn’t making existing situation worse, and is in long-run making people more secure Consequences • Current botnet strategy – Take over victim machine, then cause it to do illegal action. Only do “real” activity if illegal activity took place – Effectively disables defenders who are bound not to allow illegal activity • Strategy if defenders aren’t allowed to violate victim’s privacy: – Bind all command-and-control activity to PII of victims – Inhibit all data sharing! Burglary App • Researchers asked IRB for permission to write app that – Collected public photos from Flickr, Facebook and other places – Automatically located people who • Had address of home discoverable • Lived close to researchers • Were currently on vacation more than 1000 miles away • IRB granted complete permission (to researchers’ surprise) – Information is already publically available – Okay even to publically release application • “Girls Around Me” app – Used information from foursquare, Facebook, etc to: • Locate girls/boys currently physically close to user • Display photos and bios of said girls/boys • Is there any difference between two applications? – We know that privacy attacks are currently being done, but unpublished. If research is prohibited, then defenders are hampered Conclusions • Need to consider implications of policies – Threat models useful: • What threats are you intending to counter? • How will attackers respond to policy? – Need to consider utility/social good BACKUP SLIDES Dead Sea Scrolls • Scrolls found in 1940s – access controlled by owners, majority still unreleased by 1990 • Concordance prepared in 1950s: – Used for linguistic analysis – Alphabetical listing of words in document, along with words immediately before and after • 1991: Wacholder and Abegg reconstructed scrolls from concordance 13 Legal/Ethical Issues • Laws aren’t logical rules, conflict, ambiguous, change over time • Laws – Privacy/USA: generally industry-specific (HIPAA, Driver’s Privacy Protection Act,…) • HIPAA identifies Personally Identifying Information – Privacy/EU: global, Data Protection Directive • PII “anything that can be used to identify you” – Network data: variety of wiretapping laws, etc • Even if action is legal, may not be ethical – Laws often lag technology • Constrain big data solutions – Given data from multiple sources, what are the applicable laws? What happens when laws change? What exact purposes can the data be used for? What are the restrictions upon the analyses that can be performed? – Real issue: certain experiments can be conducted at some US universities but not others, because of different IRB rulings. 14 Data Creation/Storage • What data do you store, and how? – Store anonymized data? Create synthetic data? – What about data gathered with different anonymization requirements? • When laws/attacks change, how do you recover? – What meta-data is needed about origin of data? 15 Anonymization • Long history of de-anonymization work • Inherent tradeoff between usability and privacy/security – Attacker models often unclear • Potential solution: keep track of information already disclosed – Result: self-destructing databases 16