Privacy and Data Mining: Friends or Foes? Rakesh Agrawal IBM Almaden Research Center Theme DILEMMA Applications abound where data mining can do enormous good, but is vulnerable to misuse under misguided hands GOAL Understand the concerns with data mining and identify research directions that may address those concerns QUESTIONS Perceived concerns with data mining How real are those concerns What data mining community is doing to address the concerns What more needs to be done Panelists James Dempsey, Center for Democracy & Technology Daniel Gallington, Potomac Institute Lawrence Cox, National Center for Health Statistics Bhavani Thuraisingham, National Science Foundation Latanya Sweeney, Carnegie Mellon University Christopher Clifton, Purdue University Jeff Ullman, Stanford University Plan Position statements -- 6 minutes each Rejoinders -- 2 minutes each Questions and observations from the floor Closing statements -- 1 minute each The Potomac Institute for Policy Studies Privacy and Data Mining KDD 2003 August 25, 2003 Daniel J. Gallington New Information Technology and Privacy– Status of the Debate •Demonization of Science •Technology development vs. policy/legal “envelope” •Rules vs. Process •Enablement vs. Disablement •Secrecy •When the dust settles– what could work? Data Mining and Privacy: Friends or Foes? Dr. Bhavani Thuraisingham The National Science Foundation August 2003 Definitions Data Mining - Data mining is the process of a user analyzing large amounts of data using techniques from statistical reasoning and machine learning and discovering information often previously unknown Data fusion The process of associating records from two (or more) databases, e.g., Medical Records and Grocery Store purchases Privacy Problem User U poses queries and deduces information from the responses that U is authorized to see; U is not authorized to see the deduced information about an individual or a group of individuals G deemed private by either G or some authority - Some Data Mining Applications Medical and Healthcare - Mining genetic and medical databases and finding links between genetic composition and diseases Security - Analyzing travel records, spending patterns, associations between people and determining potential terrorists - Examining audit data and determining unauthorized network intrusions - Mining credit card transactions, telephone calls and other related data and detecting fraud and identity theft Marketing, Sales, and Finance - Understanding preferences of groups of consumers Some Privacy concerns Medical and Healthcare - Employers, marketers, or others knowing of private medical concerns Security - Allowing access to individual’s travel and spending data - Allowing access to web surfing behavior Marketing, Sales, and Finance - Allowing access to individual’s purchases Data Mining as a Threat to Privacy Data mining gives us “facts” that are not obvious to human analysts of the data Can general trends across individuals be determined without revealing information about individuals? Possible threats: Combine collections of data and infer information that is private Disease information from prescription data Military Action from Pizza delivery to pentagon Need to protect the associations and correlations between the data that are sensitive or private - Some Privacy Problems and Potential Solutions Problem: Privacy violations that result due to data mining - Potential solution: Privacy-preserving data mining Problem: Privacy violations that result due to the Inference problem - Inference is the process of deducing sensitive information from the legitimate responses received to user queries - Potential solution: Privacy Constraint Processing Problem: Privacy violations due to un-encrypted data - Potential solution: Encryption at different levels Problem: Privacy violation due to poor system design - Potential solution: Develop methodology for designing privacyenhanced systems Some Research Directions: Privacy Preserving Data Mining Prevent useful results from mining - Introduce “cover stories” to give “false” results - Only make a sample of data available so that an adversary is unable to come up with useful rules and predictive functions Randomization - Introduce random values into the data and/or results - Challenge is to introduce random values without significantly affecting the data mining results - Give range of values for results instead of exact values Secure Multi-party Computation - Each party knows its own inputs; encryption techniques used to compute final results Some Research Directions: Privacy Constraint Processing Privacy constraints processing - Based on prior research in security constraint processing - Simple Constraint: an attribute of a document is private - Content-based constraint: If document contains information about X, then it is private - Association-based Constraint: Two or more documents taken together is private; individually each document is public - Release constraint: After X is released Y becomes private Augment a database system with a privacy controller for constraint processing Some Research Directions: Encryption for Privacy Encryption at various levels - Encrypting the data as well as the results of data mining - Encryption for multi-party computation Encryption for untrusted third party publishing - Owner enforces privacy policies - Publisher gives the user only those portions of the document he/she is authorized to access - Combination of digital signatures and Merkle hash to ensure privacy Some Research Directions: Methodology for Designing Privacy Systems Jointly develop privacy policies with policy specialists Specification language for privacy policies Generate privacy constraints from the policy and check for consistency of constraints Develop a privacy model Privacy architecture that identifies privacy critical components Design and develop privacy enforcement algorithms Verification and validation Data Mining and Privacy: Friends or Foes? They are neither friends nor foes Need advances in both data mining and privacy Need to design flexible systems - For some applications one may have to focus entirely on “pure” data mining while for some others there may be a need for “privacy-preserving” data mining - Need flexible data mining techniques that can adapt to the changing environments Technologists, legal specialists, social scientists, policy makers and privacy advocates MUST work together Some NSF Projects addressing Privacy Privacy-preserving data mining - Distributed data mining techniques to replicate or approximate the results of centralized data mining, with quantifiable limits on the disclosure of data from each Privacy for Supply Chain Management - Secure Supply-Chain Collaboration protocols to enable supplychain partners to cooperatively achieve desired system-wide goals without revealing any private information, even though the jointly-computed decisions may depend on the private information of all the parties Privacy Model - Model for privacy based on secure query protocol, encryption and database organization with little trust on the client or server Other Ideas and Directions? Please contact - Dr. Bhavani Thuraisingham The National Science Foundation Suite 1115 4201 Wilson Blvd Arlington, VA 22230 Phone: 703-292-8930 Fax 703-292-9037 email: bthurais@nsf.gov Technologies for Privacy Latanya Sweeney, Ph.D. Assistant Professor of Computer Science, Technology and Policy School of Computer Science Carnegie Mellon University latanya @ privacy.cs.cmu.edu http://privacy.cs.cmu.edu/people/sweeney/index.html 6/29 Address 4 Questions 1. Concerns with data mining 2. How real are those concerns 3. What the data mining community is doing to address those concerns 4. What more needs to be done L. Sweeney. Navigating Computer Science Research Through Waves of Privacy Concerns. 2003. http://privacy.cs.cmu.edu/index.html Address 4 Questions 1. Concerns with data mining: demand for person-specific data 2. How real are those concerns: explosion in collected information individual bears risks and harms 3. What the data mining community is doing: privacy-preserving data mining too limited 4. What more needs to be done: construct technology with provable guarantees of privacy protection privacy technology Privacy Technology Center Core People Anastassia Ailamaki Chris Atkeson Guy Blelloch Manuel Blum Jamie Callan Jamie Carbonell Kathleen Carley Robert Collins Lorrie Cranor Samuel Edoho-Eket Maxine Eskenazi Scott Fahlman David Farber David Garlan Ralph Gross Alex Hauptmann Takeo Kanade Bradley Malin Bruce Maggs Tom Mitchell Norman Sadeh William Scherlis Jeff Schneider Henry Schneiderman Michael Shamos Mel Siegel Daniel Siewiorek Asim Smailagic Peter Steenkiste Scott Stevens Latanya Sweeney Katia Sycara Robert Thibedeau Howard Wactlar Alex Waibel Emerging Technologies with Privacy Concerns 1. Face recognition, Biometrics (DNA, fingerprints, iris, gait) 2. Video Surveillance, Ubiquitous Networks (Sensors) 3. Semantic Web, “Data Mining,” Bio-Terrorism Surveillance 4. Professional Assistants (email and scheduling), Lifelog recording 5. E911 Cell Phones, IR Tags, GPS 6. Personal Robots, Intelligent Spaces, CareMedia 7. Peer to peer Sharing, Spam Blockers, Instant Messaging 8. Tutoring Systems, Classroom Recording, Cheating Detectors 9. DNA sequences, Genomic data, Pharmaco-genomics Ubiquitous Data Sharing Benefits and Concerns 3. Semantic Web, “Data Mining,” Bio-Terrorism Surveillance Benefits: - Counter terrorism surveillance may improve safety. - Bio-Terrorism surveillance can save lives by early detection of a biological agent and naturally occurring outbreaks. - Semantic web enables more powerful computer uses Privacy concerns: - Erosion of civil liberties - Illegal search from law-enforcement “mining” cases - Patient privacy may render healthcare less effective. - Access to uncontrolled and unprecedented amounts of data - Collected data can be used for other government purposes 1. Concerns with Data Mining A. Video, wiretapping and surveillance B. Civil liberties, illegal search C. Medical privacy D. Employment, workplace privacy E. Educational records privacy F. Copyright law “data mining” ubiquitous data sharing, increased demand for person-specific data to realize potential benefits from algorithms Definition. Privacy Privacy reflects the ability of a person, organization, government, or entity to control its own space, where the concept of space (or “privacy space”) takes on different contexts •Physical space, against invasion •Bodily space, medical consent •Computer space, spam •Web browsing space, Internet privacy Definition. Data Privacy When privacy space refers to the fragments of data one leaves behind as a person moves through daily life, the notion of privacy is called data privacy. • No control or ownership • Historically dictated by policy and laws • Today’s technically empowered society renders overtaxes past approach Address 4 Questions 1. Concerns with data mining 2. How real are those concerns 3. What the data mining community is doing to address those concerns 4. What more needs to be done L. Sweeney. Navigating Computer Science Research Through Waves of Privacy Concerns. 2003. http://privacy.cs.cmu.edu/index.html Exponential Growth in Data Collected 35 Sewrvers (in Millions) 30 25 20 15 Growth in active web servers 10 5 0 1983 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003 1989 1991 1993 1995 1997 1999 2001 2003 500 450 GDSP (MB/person) 400 350 300 250 200 150 100 Growth in available disk storage 50 0 1983 1985 1987 Year 1991 1993 First WWW conference 1996 2001 Linking to Re-identify Data Ethnicity Name Visit date Address Diagnosis ZIP Procedure Birth date Medication Sex Total charge Medical Data Date registered Party affiliation Date last voted Voter List L. Sweeney. Weaving technology and policy together to maintain confidentiality. Journal of Law, Medicine and Ethics. 1997, 25:98-110. {date of birth, gender, 5-digit ZIP} uniquely identifies 87.1% of USA pop. Address 4 Questions 1. Concerns with data mining 2. How real are those concerns 3. What the data mining community is doing to address those concerns 4. What more needs to be done L. Sweeney. Navigating Computer Science Research Through Waves of Privacy Concerns. 2003. http://privacy.cs.cmu.edu/index.html Data Privacy: De-identification and Privacy Rights Address 4 Questions 1. Concerns with data mining 2. How real are those concerns 3. What the data mining community is doing to address those concerns 4. What more needs to be done L. Sweeney. Navigating Computer Science Research Through Waves of Privacy Concerns. 2003. http://privacy.cs.cmu.edu/index.html What More Needs to Be Done Our approach. Privacy Technology Center proactively constructs privacy technology with provable guarantees of privacy protection while allowing society to collect and share personspecific information for many worthy purposes . Some Privacy Technology Solutions - Face de-identification - Self-controlling data - Video abstraction - CertBox (“privacy appliance”) - Reasonable cause (“selective revelation”) - Distributed surveillance - Privacy and context awareness (“eWallet”) - Data valuation by simulation - Roster collocation networks - Video and sound opt-out - Text anonymizer - Privacy agent - Blocking devices - Point location query restriction k-Same Face De-identification Privacy Compliance: No matter how good face recognition software may become, it will not be able to reliably re-identify k-Same’d faces. Warranty: The resulting data remain useful for identifying suspicious behavior and identifying basic characteristics. E. Newton, L. Sweeney, and B. Malin Preserving Privacy by De-identifying Facial Images. Carnegie Mellon University, School of Computer Science, Technical Report, CMU-CS-03-119. Pittsburgh: 2003. http://privacy.cs.cmu.edu/people/sweeney/video.html Example of k-Same Faces for Varying k -Pixel -Eigen k= 2 3 5 10 50 100 Performance of k-Same Algorithm for varying values of k 1 Upper-bound on Recognition Performance = 0.9 0.8 Percent Correct, Top Rank 0.7 Expected[k-Same] k-Same-Pixel k-Same-Eigen 0.6 1 0.5 0.4 0.3 0.2 k 0.1 0 0 10 20 30 40 50 k 60 70 80 90 100 Some Attempts that Don’t Work! Single Bar Mask T-Mask Black Blob Mouth Only Grayscale Black & White Ordinal Data Threshold Pixelation Negative Grayscale Black & White Random Grayscale Black & White Mr. Potato Head Legal Flow of Medical Data for Surveillance HIPAA Explicitly Identified by Name, etc. Hospitals, Labs, Physician Offices Public Health Law Public Health “No” risk! Scientifically de-identified Surveillance Systems De-identified Data through a “Privacy Wall” Generated in Real-Time by a “CertBox” Scientifically de-identified Public Health Explicitly Identified by Name, etc. Data de-identified automatically by a tamper-resistant system specific to the data and the task. Called a “CertBox.” Risk of Re-identification Ann 9/1960 Ann “Ann” Public Health “Ann” “Ann” “9/1960 F 37213” “9/1960 F 37213” A re-identification results when a record in a sample from the BioSurveillance Datastream can reasonably be related to the patient who is the subject of the record in such a way that direct and rather specific communication with the patient is possible. Measuring Identifiability Jim Binsize of 1 Only 1 person is green with that shape head. Ken Len Mel Population Binsize of 2 2 people are gray with that shape head. Gil Hal Release Identifiability estimates, in graduated sized groupings, the number of people to which a released record is apt to refer. These groupings are called binsizes. Risk Assessment Server Inferences Population Models Assessment Engine computation models 100.0% 90.0% Cumulative Percentage of Patients Sample from Bio-Surveillance Datastream 80.0% 70.0% 60.0% 50.0% 40.0% 30.0% 20.0% 10.0% 0.0% 1 21 41 61 81 Binsize Profile of Databases The Risk Assessment Server identifies which fields and/or records in the Bio-surveillance Datastream are vulnerable to known reidentification inference strategies. The output of the assessment server is a report on the identifiability of the Bio-surveillance Datastream (not just the sample) with respect to those inference strategies. The Risk Assessment Server is licensed to Computer Information Technology Corp. (CIT). Diagram is courtesy of CIT. All rights reserved. CertBox Contains PrivaCert™ Raw data PrivaCert™ Rule-based system custom to data assessment Scientifically de-identified Reasonable Cause (“Selective Revelation”) Gross overview Sufficiently anonymous Normal operation Sufficiently de-identified Unusual activity Identifiable Suspicious activity Readily identifiable Outbreak suspected Explicitly identified Outbreak detected Datafly Idenifiability 0..1 Detection Status 0..1 Address 4 Questions 1. Concerns with data mining: demand for person-specific data 2. How real are those concerns: explosion in collected information individual bears risks and harms 3. What the data mining community is doing: privacy-preserving data mining too limited 4. What more needs to be done: construct technology with provable guarantees of privacy protection privacy technology Perceived Concerns • Data mining lets you find out about my private life – I don’t want (you, my insurance company, the government) knowing everything • Data mining doesn’t always get it right – I don’t want to be put in jail because data mining said so – I don’t want to be denied a (credit, a job, insurance) because data mining said so Perceived Concerns • Data mining lets you find out about my private life – Learned models allow conjectures – Learning the model requires collecting data • Data mining doesn’t always get it right – Our legal system is supposed to ensure due process – Data mining typically allows businesses to take risks they otherwise wouldn’t Perceived Concerns • Data mining lets you find out about my private life – Privacy-preserving data mining • Data mining doesn’t always get it right – We know it • Educate the user – We’re working on it Privacy-Preserving Data Mining Data Perturbation • Construct a data set with noise added – Can be released without revealing private data • Miners given the perturbed data set – Reconstruct distribution to improve results • Solutions out there – Decision trees, association rules • Debate: Does it really preserve privacy? – Can we prove impossibility of noise removal? Privacy-Preserving Data Mining Distributed Data Mining • Data owners keep their data – Collaborate to get data mining results • Encryption techniques to preserve privacy – Proofs that private data is not disclosed • Solutions for Decision Trees, Association Rules, Clustering – Different solutions needed depending on how data is distributed, privacy constraints What Next? • Data mining lets you find out about my private life – Constraints that allow us to restrict what models can be learned • Data mining doesn’t always get it right – Educate the public • What data mining does (and doesn’t do) – And of course, more research Some Thoughts About Privacy Jeffrey D. Ullman KDD, Aug. 25, 2003 Our Treatment of Privacy is Pretty Weird We allow spammers and cold-callers to intrude without mercy. Yet Amazon wouldn’t tell me the status of my Son’s order. And Congress killed the only system that has a hope of protecting us against mass murder by terrorists. TIA: City Walls of Today 5000 years ago, stone walls protected advanced civilizations from marauders. I doubt the first attempts were perfect (did they forget doors?), and there was a downside, e.g., restricted movement. Likewise, TIA may be the only way to keep terrorists at bay. What The “Antis” Forget There is a great difference between an inanimate machine knowing your secrets and a person knowing the same. Political solutions can control how and why information goes from the machine to trusted analysts who can act on the knowledge. Analogy From 200 years of tradition, it has become safe to put M16’s in the hands of soldiers who do not use them to rob liquor stores. Likewise, we need a cadre of trusted analysts whose job is to protect, not to intrude on the innocent. Technology Thoughts TIA is not about machine learning --we don’t have positive examples. TIA is an advanced form of datamining, where long connections are sought in massive data. e.g., multiple connections between “Al Qaida” and “flight schools.” Technology Thoughts --- (2) Possible boost: “Locality-Sensitive Hashing” (Gionis, Indyk, & Motwani). A powerful technique for focusing on low-frequency, high-correlation events. Needs generalization to graphs that represent various forms of connection.