Slides

advertisement
Foundations of Privacy
Lecture 1
Lecturer: Moni Naor
What is Privacy?
Extremely overloaded term
Hard to define
“Privacy is a value so complex, so entangled in
competing and contradictory dimensions, so
engorged with various and distinct meanings,
that I sometimes despair whether it can be usefully
addressed at all.”
Robert C. Post, Three Concepts of Privacy,
89 Geo. L.J. 2087 (2001).
Privacy is like oxygen – you only feel it when it is gone
What is Privacy?
Extremely overloaded term
• “the right to be let alone”
-
Samuel D. Warren and Louis D. Brandeis, The Right to Privacy,
Harv. L. Rev. (1890)
• “our concern over our accessibility to others: the
extent to which we are known to others, the extent
to which others have physical access to us, and the
extent to which we are the subject of others
attention.
•
- Ruth Gavison, “Privacy and the Limits of the Law,” Yale Law Journal (1980)
What is Privacy?
Extremely overloaded term
• Photojournalism
• Census data
Louis Brandeis and Samuel Warren:
• Huge databases collected by companies
The Right to Privacy, Harvard Law Rev. 1890
– Data deluge
– Example: “Ravkav”
Mandatory participation
• Public Surveillance
Information
– CamerasMust not reveal individual data
– RFIDs
• Social Networks
Official Description
The availability of fast and cheap computers coupled with
massive storage devices has enabled the collection and
mining of data on a scale previously unimaginable.
This opens the door to potential abuse regarding individuals'
information. There has been considerable research
exploring the tension between utility and privacy in this
context.
The goal is to explore techniques and issues related to data
privacy. In particular:
• Definitions of data privacy
• Techniques for achieving privacy
• Limitations on privacy in various settings.
• Privacy issues in specific settings
Planned Topics
Privacy of Data Analysis
• Differential Privacy
– Definition and Properties
– Statistical databases
– Dynamic data
• Privacy of learning
algorithms
• Privacy of genomic data
Interaction with cryptography
• SFE
• Voting
• Entropic Security
• Data Structures
• Everlasting Security
• Privacy Enhancing Tech.
– Mixed nets
Office: Ziskind 248
Phone: 3701
E-mail: moni.naor@
Course Information
Foundation of Privacy - Spring 2010
Instructor: Moni Naor
When:
Mondays, 11:00--13:00 (2 points)
Where:
Ziskind 1
•
Course web page:
www.wisdom.weizmann.ac.il/~naor/COURSE/foundations_of_privacy.html
•
Prerequisites: familiarity with algorithms, data structures, probability theory, and linear
algebra, at an undergraduate level; a basic course in computability is assumed.
• Requirements:
– Participation in discussion in class
• Best: read the papers ahead of time
– Homework: There will be several homework assignments
• Homework assignments should be turned in on time (usually two weeks after they are
given)!
– Class Project and presentation
– Exam : none planned
Projects
• Report on a paper
• Apply a notion studied to some known domain
• Checking the state of privacy is some setting
Cryptography and Privacy
Extremely relevant - but does not solve the privacy problem
Secure function Evaluation
• How to distributively compute a function f(X1, X2, …,Xn),
– where Xj known to party j.
• E.g.,  = sum(a,b,c, …)
– Parties should only learn final output ()
• Many results depending on
–
–
–
–
More worried what to compute
Number of players
than how to compute
Means of communication
The power and model of the adversary
How the function is represented
Example: Securely Computing Sums
0 · Xi · P-1. Want to compute  Xi

X1
Y1
X2
Y2
X3
Y3
X4
Y4
X5
Y5
mod P
Party 1 selects r 2R [0..P-1]. Sends Y1 = X1+r
Party i received Yi-1 and sends Yi = Yi-1+ Xi
Party 1 received Yn and announces  =  Xi = Yn-r
Is this Protocol Secure?
To talk rigorously about cryptographic security:
• Specify the Power of the Adversary
– Access to the data/system
– Computational power?
– “Auxiliary” information?
If it controls two
players - insecure
• Define a Break of the System
– What is compromise
– What is a “win” for the adversary?
Can be all
powerful here
The Simulation Paradigm
A protocol is considered secure if:
• For every adversary (of a certain type)
There exists a simulator that outputs an indistinguishable
``transcript” .
Examples:
• Encryption
• Zero-knowledge
• Secure function evaluation
Power of analogy
SFE: Simulating the ideal model
A protocol is considered secure if:
• For every adversary there exists a simulator
operating in the ``ideal” (trusted party) model that
outputs an indistinguishable transcript.
Breaking = distinguishing!
Major result: “Any function f that can be evaluated
using polynomial resources can be securely
evaluated using polynomial resources”
The Problem with SFE
SFE does not imply privacy:
• The problem is with ideal model
– E.g.,  = sum(a,b)
– Each player learns only what can be deduced from 
and her own input to f
– if  and a yield b, so be it.
Need ways of talking about leakage even in the ideal
model
Statistical Data Analysis
Huge social benefits from analyzing large collections of data:
Finding correlations
E.g. medical: genotype/phenotype correlations
Providing better services
WHAT ABOUT PRIVACY?
Improve web search results, fit ads to queries
Publishing Official Statistics
Census, contingency tables
•
Datamining
Better Privacy Better Data
Clustering, learning association rules, decision trees, separators,
principal component analysis
However: data contains confidential information
Example of Utility
Cholera
cases
Suspected
pump
John Snow’s map
Cholera cases in
London 1854
epidemic
Modern Privacy of Data Analysis
Is public analysis of private data a
meaningful/achievable Goal?
The holy grail:
Get utility of statistical analysis
while protecting privacy of every individual
participant
Ideally:
“privacy-preserving” sanitization allows reasonably
accurate answers to meaningful information
Sanitization: Traditional View
Curator/
Sanitizer
Data
A
Output
Trusted curator can access DB of sensitive information,
should publish privacy-preserving sanitized version
Traditional View: Interactive Model
query 1
Sanitizer
query 2
Data
Multiple queries, chosen adaptively
?
Sanitization: Traditional View
Curator/
Sanitizer
Data
How to sanitize
Anonymization?
A
Output
Auxiliary Information
• Information from any source other than the
statistical database
–
–
–
–
–
Other databases, including old releases of this one
Newspapers
General comments from insiders
Government reports, census website
Inside information from a different organization
• Eg, Google’s view, if the attacker/user is a Google employee
Linkage Attacks: Malicious Use of Aux Info
The Netflix Prize
• Netflix Recommends Movies to its Subscribers
– Seeks improved recommendation system
– Offered $1,000,000 for 10% improvement
• Not concerned here with how this is measured
– Published training data
Prize won in September 2009
“BellKor's Pragmatic Chaos team”
From the Netflix Prize Rules Page…
• “The training data set consists of more than 100 million ratings
from over 480 thousand randomly-chosen, anonymous
customers on nearly 18 thousand movie titles.”
• “The ratings are on a scale from 1 to 5 (integral) stars. To
protect customer privacy, all personal information
identifying individual customers has been removed and all
customer ids have been replaced by randomly-assigned
ids. The date of each rating and the title and year of release
for each movie are provided.”
Netflix Data Release [Narayanan-Shmatikov 2008]
• Ratings for subset of
movies and users
• Usernames replaced
with random IDs
• Some additional
perturbation
Credit: Arvind Narayanan via Adam Smith
A Source of Auxiliary Information
• Internet Movie Database (IMDb)
– Individuals may register for an account and rate movies
– Need not be anonymous
• Probably want to create some web presence
– Visible material includes ratings, dates, comments
Use Public Reviews from IMDb.com
Alice
Bob
Charlie
Danielle
Erica
Frank
Anonymized
NetFlix data
=
Credit: Arvind Narayanan via Adam Smith
Public, incomplete
IMDB data
Alice
Bob
Charlie
Danielle
Erica
Frank
Identified NetFlix Data
De-anonymizing the Netflix Dataset
Results
of which 2 may be completely wrong
• “With 8 movie ratings and dates that may have a 3-day error, 96% of
Netflix subscribers whose records have been released can be
uniquely identified in the dataset.”
• “For 89%, 2 ratings and dates are enough to reduce the set of
plausible records to 8 out of almost 500,000, which can then be
inspected by a human for further deanonymization.”
Consequences?
Settled, March 2010
– Learn about movies that IMDB users didn’t want to tell
the world about...
Sexual orientation, religious beliefs Video Privacy
Protection Act 1988
– Subject of current lawsuits
Credit: Arvind Narayanan via Adam Smith
AOL Search History Release (2006)
• 650,000 users, 20 Million queries, 3 months
• AOL’s goal:
– provide real query logs from real users
• Privacy?
– “Identifying information” replaced with random identifiers
– But: different searches by the same user still linked
30
AOL Search History Release (2006)
Name: Thelma Arnold
Age: 62
Widow
Residence: Lilburn, GA
31
Other Successful Attacks
• Against anonymized HMO records [Sweeny 98]
– Proposed K-anonymity
• Against K-anonymity
[MGK06]
– Proposed L-diversity
• Against L-diversity
[XT07]
– Proposed M-Invariance
• Against all of the above [GKS08]
“Composition” Attacks [Ganta-Kasiviswanathan-Smith, KDD 2008]
Individuals
Curators
Hospital
A
statsA
Hospital
B
statsB
Attac
ker
sensitive
information
• Example: two hospitals serve overlapping populations
 What if they independently release “anonymized” statistics?
• Composition attack: Combine independent releases
33
“Composition” Attacks [Ganta-Kasiviswanathan-Smith, KDD 2008]
Individuals
Curators
“Adam has either diabetes
or high blood pressure”
Hospital
A
statsA
Hospital
B
statsB
Attac
ker
sensitive
information
“Adam has either diabetes
or emphyzema”
• Example: two hospitals serve overlapping populations
 What if they independently release “anonymized” statistics?
• Composition attack: Combine independent releases
34
“Composition” Attacks [Ganta-Kasiviswanathan-Smith, KDD 2008]
• “IPUMS” census data set. 70,000 people, randomly split into 2
pieces with overlap 5,000.
With popular technique
(k-anonymity, k=30) for each
database, can learn “sensitive”
variable for 40% of individuals
35
Analysis of Social Network Graphs
• “Friendship” Graph
– Nodes correspond to users
– Users may list others as “friend,” creating an edge
• Edges are annotated with directional information
• Hypothetical Research Question
– How frequently is the “friend” designation reciprocated?
Attack
• Replace node names/labels with random identifiers
• Permits analysis of the structure of the graph
• Privacy hope: randomized identifiers make it
hard/impossible to identify nodes with specific
individuals,
– thereby hiding the privacy of who is connected to whom
• Disastrous! [Blum Dwork K07]
– Vulnerable to active and passive attacks
Flavor of Active Attack

Connections:



Targets: “Steve” and “Jerry”
Attack Contacts: A and B
Finding A and B allows finding Steve and Jerry
A
S
B
J
Flavor of Active Attack

Magic Step


Isolate lightly linked-in subgraphs from rest of graph
Special structure of subgraph permits finding A, B
A
S
B
J
Why Settle for Ad Hoc Notions of
Privacy?
Dalenius, 1977:
• Anything that can be learned about a respondent from the
statistical database can be learned without access to the
database
– Captures possibility that “I” may be an extrovert
– The database doesn’t leak personal information
– Adversary is a user
GoldwasserMicali 1982
• Analogous to Semantic Security for Crypto
– Anything that can be learned from the ciphertext can be learned
without the ciphertext
– Adversary is an eavesdropper
Computational Security of Encryption
Semantic Security
Whatever Adversary A can compute on encrypted string X
0,1n, so can A’ that does not see the encryption of X,
yet simulates A’s knowledge with respect to X
A selects:
• Distribution Dn on 0,1n
• Relation R(X,Y) - computable in probabilistic polynomial time
For every pptm A there is an pptm A’ so that for all pptm relation R
for XR Dn
 PrR(X,A(E(X)) - PrR(X,A’())  
is negligible
Outputs of A and A’ are indistinguishable even for a tester who knows X
A: Dn
A’: Dn
E(X)
X 2R Dn
.
A
X
A’
Y
X
R
Y
R
¼
Making it Slightly less Vague
Cryptographic Rigor Applied to Privacy
• Define a Break of the System
– What is compromise
– What is a “win” for the adversary?
• Specify the Power of the Adversary
– Access to the data
– Computational power?
– “Auxiliary” information?
• Conservative/Paranoid by Nature
– Protect against all feasible attacks
In full generality: Dalenius Goal
Impossible
– Database teaches smoking causes cancer
– I smoke in public
– Access to DB teaches that I am at increased risk for
cancer
• But what about cases where there is significant
knowledge about database distribution
Outline
• The Framework
• A General Impossibility Result
– Dalenius’ goal cannot be achieved in a very general
sense
• The Proof
– Simplified
– General case
Two Models
San
Database
?
Sanitized Database
Non-Interactive: Data are sanitized and released
Two Models
San
?
Database
Interactive: Multiple Queries, Adaptively Chosen
Auxiliary Information
Common theme in many privacy horror stories:
• Not taking into account side information
– Netflix challenge: not taking into account IMDb
[Narayanan-Shmatikov]
The Database
SAN(DB) =remove
names
The auxiliary
information
Not learning from DB
With access to the database
DB
Without access to the database
DB
San
A
Auxiliary Information
San
A’
Auxiliary Information
There is some utility of DB that legitimate users should learn
• Possible breach of privacy
• Goal: users learn the utility without the breach
Not learning from DB
With access to the database
DB
Without access to the database
DB
San
A
Auxiliary Information
San
A’
Auxiliary Information
Want: anything that can be learned about an individual from the
database can be learned without access to the database
•
8 D 8 A 9 A’ whp DB 2R D 8 auxiliary information z
|Prob [A(z) $ DB wins] – Prob[A’(z) wins]| is small
Illustrative Example for Difficulty
Want: anything that can be learned about a respondent from the
database can be learned without access to the database
• More Formally
8D 8A 9A’ whp DB 2R D 8 auxiliary information z
|Probability [A(z) $ DB wins] – Probability [A’(z) wins]| is
small
Example: suppose height of individual is sensitive information
– Average height in DB not known a priori
• Aux z = “Adam is 5 cm shorter than average in DB”
– A learns average height in DB, hence, also Adam’s height
– A’ does not
Defining “Win”: The Compromise Function
Notion of privacy compromise
Adv
y
DB
Compromise?
Privacy compromise should be non trivial:
0/1
•Should not be possible to find privacy breach from
auxiliary information alone
Privacy breach should exist:
•Given DB there should be y that is a privacy breach
•Should be possible to find y efficiently
Privacy
breach
D
Basic Concepts
• Distribution on (Finite) Databases D
– Something about the database must be unknown
– Captures knowledge about the domain
• E.g., rows of database correspond to owners of 2 pets
• Privacy Mechanism San(D, DB)
– Can be interactive or non-interactive
– May have access to the distribution D
• Auxiliary Information Generator AuxGen(D, DB)
– Has access to the distribution and to DB
– Formalizes partial knowledge about DB
• Utility Vector w
– Answers to k questions about the DB
– (Most of) utility vector can be learned by user
– Utility: Must inherit sufficient min-entropy from source D
Impossibility Theorem: Informal
• For any* distribution D on Databases DB
• For any* reasonable privacy compromise decider C.
• Fix any useful* privacy mechanism San
Tells us information we did not know
Then
• There is an auxiliary info generator AuxGen and
an adversary A
z=AuxGen(DB)
Such that
• For all adversary simulators A’
Finds a compromise
[A(z) $ San( DB)] wins, but [A’(z)] does not win
Impossibility Theorem
Fix any useful* privacy mechanism San and any reasonable
privacy compromise decider C. Then
There is an auxiliary info generator AuxGen and an adversary A
such that for “all” distributions D and all adversary simulators A’
Pr[A(D, San(D,DB), AuxGen(D, DB)) wins]
- Pr[A’(D, AuxGen(D, DB)) wins] ≥ 
for suitable, large, .
The probability spaces are over choice of DB 2R D and the
coin flips of San, AuxGen, A, and A’
To completely specify: need assumption on the entropy of utility vector
W and how well SAN(W) behaves
Strategy
• The auxiliary info generator will provide a hint
that together with the utility vector w will yield the
privacy breach.
• Want AuxGen to work without knowing D just DB
– Find privacy breach y and encode in z
– Make sure z alone does not give y. Only with w
• Complication: is the utility vector w
– Completely learned by the user?
– Or just an approximation?
Entropy of Random Sources
•
Source:
– Probability distribution X on {0,1}n.
– Contains some “randomness”.
•
Measure of “randomness”
{0,1}n
– Shannon entropy: H(X) = - ∑ x Γ Px (x) log Px (x)
• Represents how much we can compress X on the average
But even a high entropy source may have a point with prob 0.9
– min-entropy: Hmin(X) = - log max x Γ Px (x)
• Represents the most likely value of X
Definition: X is a k-source if H1(X) ¸ k .
i.e. Pr[X=x] · 2-k for all x
Min-entropy
• Definition: X is a k-source if H1(X) ¸ k.
i.e. Pr[X=x] · 2-k for all x
• Examples:
– Bit-fixing: some k coordinates of X uniform, rest fixed
• or even depend arbitrarily on others.
– Unpredictable Source: 8 i2[n], b1, ..., bi-12 {0,1},
b1, ..., bi-1] · 1-k/n
– Flat k-source: Uniform over S µ {0,1}n, |S|=2k
k/n· Prob[Xi =1| X1, X2, … Xi-1=
• Fact every k-source is convex combination of flat ones.
Min-Entropy and Statistical Distance
For a probability distribution X over {0,1}n
H1(X) = - log maxx Pr[X = x]
Represents the probability of the most likely value of X
X is a k-source if H1(X) ¸ k
Statistical distance:
¢(X,Y) = a |Pr[X=a] – Pr[Y=a]|
Want to be close to uniform distribution:
Extractors
Universal procedure for “purifying” an imperfect source
Definition:
Ext: {0,1}n £ {0,1}d ! {0,1}ℓ is a (k,)-extractor if:
for any k-source X
¢(Ext(X, Ud), Uℓ) · 
k-source of length n
x
“seed”
d random bits
s
EXT
ℓ almost-uniform bits
2k strings
{0,1}n
Strong extractors
Output looks random even after seeing the seed.
Definition: Ext is a (k,) strong extractor if
Ext’(x,s)= s ◦ Ext(x,s) is a
(k,)-extractor
• i.e. 8 k-sources X, for a 1- ’ frac. of s 2 {0,1}d
Ext(X,s) is -close to Uℓ.
Extractors from Hash Functions
• Leftover Hash Lemma [ILL89]: universal (pairwise
independent) hash functions yield strong extractors
– output length: ℓ = k-O(1) ℓ = k – 2log(1/)
– seed length: d = O(n)
Example: Ext(x,(a,b))=first ℓ bits of a¢x+b in
GF[2n]
• Almost pairwise independence:
– seed length: d= O(log n+k)
Suppose w Learned Completely
AuxGen and A share a
secret: w
San
DB
Aux
Gen
z
AuxGen(DB)
• Find privacy breach y
of DB of length ℓ
• Find w from DB
w
–
A
•
C
0/1
simulate A
Choose s2R{0,1}d and
compute Ext(w,s)
Set z = (s, Ext(w,s)©y)
Suppose w Learned Completely
AuxGen and A share a
secret: w
San
DB
Aux
Gen
z
w
DB
A
C
z = (s, Ext(w,s) © y)
Aux
Gen
z
A’
C
0/1
Technical Conditions: Hmin (W|y) ≥ |y| and |y| “safe”
0/1
Why is it a compromise?
AuxGen and A share a
secret: w
San
DB
Aux
Gen
z
Why doesn’t A’ learn y:
• For each possible value of y
(s, Ext(w,s)) is -close to
uniform
• Hence:
(s, Ext(w,s) © y) is -close
to uniform
w
A
C
z = (s, Ext(w,s) © y)
0/1
Need Hmin(W) ¸ 3ℓ+O(1)
Technical Conditions: Hmin (W|y) ≥ |y| and |y| “safe”
To complete the proof
• Handle the case where not all of w is retrieved
Download