Privacy and Data Mining:
Friends or Foes?
Rakesh Agrawal
IBM Almaden Research Center
Theme
DILEMMA
 Applications abound where data mining can do enormous
good, but is vulnerable to misuse under misguided hands
GOAL
 Understand the concerns with data mining and identify
research directions that may address those concerns
QUESTIONS
 Perceived concerns with data mining
 How real are those concerns
 What data mining community is doing to address the concerns
 What more needs to be done
Panelists

James Dempsey, Center for Democracy & Technology

Daniel Gallington, Potomac Institute

Lawrence Cox, National Center for Health Statistics

Bhavani Thuraisingham, National Science Foundation

Latanya Sweeney, Carnegie Mellon University

Christopher Clifton, Purdue University

Jeff Ullman, Stanford University
Plan

Position statements -- 6 minutes each

Rejoinders -- 2 minutes each

Questions and observations from the floor

Closing statements -- 1 minute each
The Potomac Institute for Policy
Studies
Privacy and Data Mining
KDD 2003
August 25, 2003
Daniel J. Gallington
New Information Technology and
Privacy– Status of the Debate
•Demonization of Science
•Technology development vs. policy/legal
“envelope”
•Rules vs. Process
•Enablement vs. Disablement
•Secrecy
•When the dust settles– what could work?
Data Mining and Privacy:
Friends or Foes?
Dr. Bhavani Thuraisingham
The National Science Foundation
August 2003
Definitions
 Data Mining
- Data mining is the process of a user analyzing large amounts of
data using techniques from statistical reasoning and machine
learning and discovering information often previously unknown
 Data fusion
The process of associating records from two (or more)
databases, e.g., Medical Records and Grocery Store purchases
 Privacy Problem
User U poses queries and deduces information from the
responses that U is authorized to see; U is not authorized to see
the deduced information about an individual or a group of
individuals G deemed private by either G or some authority
-
Some Data Mining Applications
 Medical and Healthcare
- Mining genetic and medical databases and finding links between
genetic composition and diseases
 Security
- Analyzing travel records, spending patterns, associations
between people and determining potential terrorists
- Examining audit data and determining unauthorized network
intrusions
- Mining credit card transactions, telephone calls and other
related data and detecting fraud and identity theft
 Marketing, Sales, and Finance
- Understanding
preferences of groups of consumers
Some Privacy concerns
 Medical and Healthcare
- Employers, marketers, or others knowing of private medical
concerns
 Security
- Allowing access to individual’s travel and spending data
- Allowing access to web surfing behavior
 Marketing, Sales, and Finance
- Allowing access to individual’s purchases
Data Mining as a Threat to Privacy
 Data mining gives us “facts” that are not obvious to human analysts
of the data
 Can general trends across individuals be determined without
revealing information about individuals?
 Possible threats:
Combine collections of data and infer information that is private
 Disease information from prescription data
 Military Action from Pizza delivery to pentagon
 Need to protect the associations and correlations between the data
that are sensitive or private
-
Some Privacy Problems and Potential Solutions
 Problem: Privacy violations that result due to data mining
- Potential solution: Privacy-preserving data mining
 Problem: Privacy violations that result due to the Inference problem
- Inference is the process of deducing sensitive information from
the legitimate responses received to user queries
- Potential solution: Privacy Constraint Processing
 Problem: Privacy violations due to un-encrypted data
- Potential solution: Encryption at different levels
 Problem: Privacy violation due to poor system design
- Potential solution: Develop methodology for designing privacyenhanced systems
Some Research Directions:
Privacy Preserving Data Mining
 Prevent useful results from mining
- Introduce “cover stories” to give “false” results
- Only make a sample of data available so that an adversary is
unable to come up with useful rules and predictive functions
 Randomization
- Introduce random values into the data and/or results
- Challenge is to introduce random values without significantly
affecting the data mining results
- Give range of values for results instead of exact values
 Secure Multi-party Computation
- Each party knows its own inputs; encryption techniques used to
compute final results
Some Research Directions:
Privacy Constraint Processing
 Privacy constraints processing
- Based on prior research in security constraint processing
- Simple Constraint: an attribute of a document is private
- Content-based constraint: If document contains information
about X, then it is private
- Association-based Constraint: Two or more documents taken
together is private; individually each document is public
- Release constraint: After X is released Y becomes private
 Augment a database system with a privacy controller for constraint
processing
Some Research Directions:
Encryption for Privacy
 Encryption at various levels
- Encrypting the data as well as the results of data mining
- Encryption for multi-party computation
 Encryption for untrusted third party publishing
- Owner enforces privacy policies
- Publisher gives the user only those portions of the document
he/she is authorized to access
- Combination of digital signatures and Merkle hash to ensure
privacy
Some Research Directions:
Methodology for Designing Privacy Systems
 Jointly develop privacy policies with policy specialists
 Specification language for privacy policies
 Generate privacy constraints from the policy and check for
consistency of constraints
 Develop a privacy model
 Privacy architecture that identifies privacy critical components
 Design and develop privacy enforcement algorithms
 Verification and validation
Data Mining and Privacy: Friends or Foes?
 They are neither friends nor foes
 Need advances in both data mining and privacy
 Need to design flexible systems
- For some applications one may have to focus entirely on “pure”
data mining while for some others there may be a need for
“privacy-preserving” data mining
- Need flexible data mining techniques that can adapt to the
changing environments
 Technologists, legal specialists, social scientists, policy makers and
privacy advocates MUST work together
Some NSF Projects addressing Privacy
 Privacy-preserving data mining
- Distributed data mining techniques to replicate or approximate
the results of centralized data mining, with quantifiable limits on
the disclosure of data from each
 Privacy for Supply Chain Management
- Secure Supply-Chain Collaboration protocols to enable supplychain partners to cooperatively achieve desired system-wide
goals without revealing any private information, even though the
jointly-computed decisions may depend on the private
information of all the parties
 Privacy Model
- Model for privacy based on secure query protocol, encryption
and database organization with little trust on the client or server
Other Ideas and Directions?
 Please contact
- Dr. Bhavani Thuraisingham
The National Science Foundation
Suite 1115
4201 Wilson Blvd
Arlington, VA 22230
Phone: 703-292-8930
Fax 703-292-9037
email: bthurais@nsf.gov
Technologies for Privacy
Latanya Sweeney, Ph.D.
Assistant Professor of Computer Science, Technology and Policy
School of Computer Science
Carnegie Mellon University
latanya @ privacy.cs.cmu.edu
http://privacy.cs.cmu.edu/people/sweeney/index.html
6/29
Address 4 Questions
1. Concerns with data mining
2. How real are those concerns
3. What the data mining community is doing
to address those concerns
4. What more needs to be done
L. Sweeney. Navigating Computer Science Research Through Waves of
Privacy Concerns. 2003. http://privacy.cs.cmu.edu/index.html
Address 4 Questions
1. Concerns with data mining:
demand for person-specific data
2. How real are those concerns:
explosion in collected information
individual bears risks and harms
3. What the data mining community is doing:
privacy-preserving data mining too limited
4. What more needs to be done:
construct technology with provable guarantees
of privacy protection privacy technology
Privacy Technology Center
Core People
Anastassia Ailamaki
Chris Atkeson
Guy Blelloch
Manuel Blum
Jamie Callan
Jamie Carbonell
Kathleen Carley
Robert Collins
Lorrie Cranor
Samuel Edoho-Eket
Maxine Eskenazi
Scott Fahlman
David Farber
David Garlan
Ralph Gross
Alex Hauptmann
Takeo Kanade
Bradley Malin
Bruce Maggs
Tom Mitchell
Norman Sadeh
William Scherlis
Jeff Schneider
Henry Schneiderman
Michael Shamos
Mel Siegel
Daniel Siewiorek
Asim Smailagic
Peter Steenkiste
Scott Stevens
Latanya Sweeney
Katia Sycara
Robert Thibedeau
Howard Wactlar
Alex Waibel
Emerging Technologies
with Privacy Concerns
1. Face recognition, Biometrics (DNA, fingerprints, iris, gait)
2. Video Surveillance, Ubiquitous Networks (Sensors)
3. Semantic Web, “Data Mining,” Bio-Terrorism Surveillance
4. Professional Assistants (email and scheduling),
Lifelog recording
5. E911 Cell Phones, IR Tags, GPS
6. Personal Robots, Intelligent Spaces, CareMedia
7. Peer to peer Sharing, Spam Blockers, Instant Messaging
8. Tutoring Systems, Classroom Recording,
Cheating Detectors
9. DNA sequences, Genomic data, Pharmaco-genomics
Ubiquitous Data Sharing
Benefits and Concerns
3. Semantic Web, “Data Mining,” Bio-Terrorism Surveillance
Benefits:
- Counter terrorism surveillance may improve safety.
- Bio-Terrorism surveillance can save lives by early
detection of a biological agent and naturally occurring
outbreaks.
- Semantic web enables more powerful computer uses
Privacy concerns:
- Erosion of civil liberties
- Illegal search from law-enforcement “mining” cases
- Patient privacy may render healthcare less effective.
- Access to uncontrolled and unprecedented amounts of data
- Collected data can be used for other government purposes
1. Concerns with Data Mining
A. Video, wiretapping and surveillance
B. Civil liberties, illegal search
C. Medical privacy
D. Employment, workplace privacy
E. Educational records privacy
F. Copyright law
“data mining”
 ubiquitous data sharing, increased demand
for person-specific data to realize potential
benefits from algorithms
Definition. Privacy
Privacy reflects the ability of a person,
organization, government, or entity to control its
own space, where the concept of space (or
“privacy space”) takes on different contexts
•Physical space, against invasion
•Bodily space, medical consent
•Computer space, spam
•Web browsing space, Internet privacy
Definition. Data Privacy
When privacy space refers to the fragments of
data one leaves behind as a person moves
through daily life, the notion of privacy is called
data privacy.
• No control or ownership
• Historically dictated by policy and laws
• Today’s technically empowered society
renders overtaxes past approach
Address 4 Questions
1. Concerns with data mining
2. How real are those concerns
3. What the data mining community is doing
to address those concerns
4. What more needs to be done
L. Sweeney. Navigating Computer Science Research Through Waves of
Privacy Concerns. 2003. http://privacy.cs.cmu.edu/index.html
Exponential Growth in Data Collected
35
Sewrvers (in Millions)
30
25
20
15
Growth in
active web
servers
10
5
0
1983
1985
1987
1989
1991
1993
1995
1997
1999
2001
2003
1989
1991
1993
1995
1997
1999
2001
2003
500
450
GDSP (MB/person)
400
350
300
250
200
150
100
Growth in
available
disk
storage
50
0
1983
1985
1987
Year
1991
1993 First
WWW
conference
1996
2001
Linking to Re-identify Data
Ethnicity
Name
Visit date
Address
Diagnosis
ZIP
Procedure
Birth
date
Medication
Sex
Total charge
Medical Data
Date
registered
Party
affiliation
Date last
voted
Voter List
L. Sweeney. Weaving technology and policy together to maintain confidentiality. Journal of
Law, Medicine and Ethics. 1997, 25:98-110.
{date of birth, gender, 5-digit ZIP}
uniquely identifies 87.1% of USA pop.
Address 4 Questions
1. Concerns with data mining
2. How real are those concerns
3. What the data mining community is doing
to address those concerns
4. What more needs to be done
L. Sweeney. Navigating Computer Science Research Through Waves of
Privacy Concerns. 2003. http://privacy.cs.cmu.edu/index.html
Data Privacy:
De-identification and Privacy Rights
Address 4 Questions
1. Concerns with data mining
2. How real are those concerns
3. What the data mining community is doing
to address those concerns
4. What more needs to be done
L. Sweeney. Navigating Computer Science Research Through Waves of
Privacy Concerns. 2003. http://privacy.cs.cmu.edu/index.html
What More Needs to Be Done
Our approach.
Privacy Technology Center proactively
constructs privacy technology with provable
guarantees of privacy protection while
allowing society to collect and share personspecific information for many worthy purposes
.
Some Privacy Technology Solutions
- Face de-identification
- Self-controlling data
- Video abstraction
- CertBox (“privacy appliance”)
- Reasonable cause (“selective revelation”)
- Distributed surveillance
- Privacy and context awareness (“eWallet”)
- Data valuation by simulation
- Roster collocation networks
- Video and sound opt-out
- Text anonymizer
- Privacy agent
- Blocking devices
- Point location query restriction
k-Same Face De-identification
Privacy Compliance:
No matter how good face recognition software
may become, it will not be able to reliably
re-identify k-Same’d faces.
Warranty:
The resulting data remain useful for identifying
suspicious behavior and identifying basic
characteristics.
E. Newton, L. Sweeney, and B. Malin Preserving Privacy by De-identifying Facial Images.
Carnegie Mellon University, School of Computer Science, Technical Report, CMU-CS-03-119.
Pittsburgh: 2003. http://privacy.cs.cmu.edu/people/sweeney/video.html
Example of k-Same Faces for Varying k
-Pixel
-Eigen
k=
2
3
5
10
50
100
Performance of k-Same Algorithm
for varying values of k
1
Upper-bound
on Recognition
Performance =
0.9
0.8
Percent Correct, Top Rank
0.7
Expected[k-Same]
k-Same-Pixel
k-Same-Eigen
0.6
1
0.5
0.4
0.3
0.2
k
0.1
0
0
10
20
30
40
50
k
60
70
80
90
100
Some Attempts that Don’t Work!






Single Bar Mask
T-Mask
Black Blob
Mouth Only

 Grayscale
 Black & White

Ordinal Data
Threshold

Pixelation
Negative
 Grayscale
 Black & White
Random
 Grayscale
 Black & White

Mr. Potato Head
Legal Flow of Medical Data for Surveillance
HIPAA
Explicitly
Identified by
Name, etc.
Hospitals, Labs,
Physician Offices
Public
Health
Law
Public
Health
“No”
risk!
Scientifically
de-identified
Surveillance
Systems
De-identified Data through a “Privacy Wall”
Generated in Real-Time by a “CertBox”
Scientifically
de-identified
Public
Health
Explicitly
Identified by
Name, etc.
Data de-identified automatically by a
tamper-resistant system specific to the
data and the task. Called a “CertBox.”
Risk of Re-identification
Ann
9/1960
Ann
“Ann”
Public
Health
“Ann”
“Ann”
“9/1960 F 37213”
“9/1960 F 37213”
A re-identification results when a record in a sample from the BioSurveillance Datastream can reasonably be related to the patient who
is the subject of the record in such a way that direct and rather
specific communication with the patient is possible.
Measuring Identifiability
Jim
Binsize of 1
Only 1 person is green
with that shape head.
Ken Len Mel
Population
Binsize of 2
2 people are gray with
that shape head.
Gil
Hal
Release
Identifiability estimates, in graduated sized groupings,
the number of people to which a released record is apt
to refer. These groupings are called binsizes.
Risk Assessment Server
Inferences
Population
Models
Assessment
Engine
computation
models
100.0%
90.0%
Cumulative Percentage of
Patients
Sample from
Bio-Surveillance
Datastream
80.0%
70.0%
60.0%
50.0%
40.0%
30.0%
20.0%
10.0%
0.0%
1
21
41
61
81
Binsize
Profile
of Databases
The Risk Assessment Server identifies which fields and/or records in
the Bio-surveillance Datastream are vulnerable to known reidentification inference strategies. The output of the assessment server
is a report on the identifiability of the Bio-surveillance Datastream
(not just the sample) with respect to those inference strategies.
The Risk Assessment Server is licensed to Computer Information Technology Corp.
(CIT). Diagram is courtesy of CIT. All rights reserved.
CertBox Contains PrivaCert™
Raw data
PrivaCert™
Rule-based system custom
to data assessment
Scientifically
de-identified
Reasonable Cause
(“Selective Revelation”)
Gross overview
Sufficiently anonymous
Normal operation
Sufficiently de-identified
Unusual activity
Identifiable
Suspicious activity
Readily identifiable
Outbreak suspected
Explicitly identified
Outbreak detected
Datafly Idenifiability 0..1
Detection Status 0..1
Address 4 Questions
1. Concerns with data mining:
demand for person-specific data
2. How real are those concerns:
explosion in collected information
individual bears risks and harms
3. What the data mining community is doing:
privacy-preserving data mining too limited
4. What more needs to be done:
construct technology with provable guarantees
of privacy protection privacy technology
Perceived Concerns
• Data mining lets you find out about my
private life
– I don’t want (you, my insurance company, the
government) knowing everything
• Data mining doesn’t always get it right
– I don’t want to be put in jail because data
mining said so
– I don’t want to be denied a (credit, a job,
insurance) because data mining said so
Perceived Concerns
• Data mining lets you find out about my
private life
– Learned models allow conjectures
– Learning the model requires collecting data
• Data mining doesn’t always get it right
– Our legal system is supposed to ensure due
process
– Data mining typically allows businesses to
take risks they otherwise wouldn’t
Perceived Concerns
• Data mining lets you find out about my
private life
– Privacy-preserving data mining
• Data mining doesn’t always get it right
– We know it
• Educate the user
– We’re working on it
Privacy-Preserving Data Mining
Data Perturbation
• Construct a data set with noise added
– Can be released without revealing private
data
• Miners given the perturbed data set
– Reconstruct distribution to improve results
• Solutions out there
– Decision trees, association rules
• Debate: Does it really preserve privacy?
– Can we prove impossibility of noise removal?
Privacy-Preserving Data Mining
Distributed Data Mining
• Data owners keep their data
– Collaborate to get data mining results
• Encryption techniques to preserve privacy
– Proofs that private data is not disclosed
• Solutions for Decision Trees, Association
Rules, Clustering
– Different solutions needed depending on how
data is distributed, privacy constraints
What Next?
• Data mining lets you find out about my
private life
– Constraints that allow us to restrict what
models can be learned
• Data mining doesn’t always get it right
– Educate the public
• What data mining does (and doesn’t do)
– And of course, more research
Some Thoughts About Privacy
Jeffrey D. Ullman
KDD, Aug. 25, 2003
Our Treatment of Privacy is
Pretty Weird
We allow spammers and cold-callers to
intrude without mercy.
Yet Amazon wouldn’t tell me the status
of my Son’s order.
And Congress killed the only system
that has a hope of protecting us against
mass murder by terrorists.
TIA: City Walls of Today
5000 years ago, stone walls protected
advanced civilizations from marauders.
I doubt the first attempts were perfect
(did they forget doors?), and there was
a downside, e.g., restricted movement.
Likewise, TIA may be the only way to
keep terrorists at bay.
What The “Antis” Forget
There is a great difference between an
inanimate machine knowing your
secrets and a person knowing the
same.
Political solutions can control how and
why information goes from the machine
to trusted analysts who can act on the
knowledge.
Analogy
From 200 years of tradition, it has
become safe to put M16’s in the hands
of soldiers who do not use them to rob
liquor stores.
Likewise, we need a cadre of trusted
analysts whose job is to protect, not to
intrude on the innocent.
Technology Thoughts
TIA is not about machine learning --we don’t have positive examples.
TIA is an advanced form of datamining, where long connections are
sought in massive data.
 e.g., multiple connections between “Al
Qaida” and “flight schools.”
Technology Thoughts --- (2)
Possible boost: “Locality-Sensitive
Hashing” (Gionis, Indyk, & Motwani).
A powerful technique for focusing on
low-frequency, high-correlation events.
Needs generalization to graphs that
represent various forms of connection.