CS346:
Advanced Databases
Alexandra I. Cristea
A.I.Cristea@warwick.ac.uk
Data Security and
Privacy
Outline
Chapter: “Database Security” in Elmasri and Navathe
(chapter 24, 6th Edition)
 Brief overview of database security
 More detailed study of database privacy
Statistical databases, and differential privacy to protect data
– Data anonymization: k-anonymity and l-diversity
–
Why?
 A topical issue: privacy and security are big concerns
 Connections to computer security, statistics
2
CS346 Advanced Databases
Database Security and Privacy
Database Security and Privacy is a large and complex area, covering:
 Legal and ethical requirements for data privacy
E.g. UK Data Protection Act (1998)
– Determines for how long data can be retained about people
–
 Government and organisation policy issues on data sharing
–
E.g. when and how credit reports, medical data can be shared
 System-level issues for security management
–
How is access to data controlled by the system?
 Classification of data by security levels (secret, unclassified)
–
3
How to determine if access to data is permitted?
CS346 Advanced Databases
Threats to Databases
Databases faces many threats which must be protected against
 Integrity: prevent improper modification of the data
Caused by intention or accident, insider or outsider threat
– Erroneous data can lead to incorrect decisions, fraud, errors
–
 Availability: is the data available to users and programs?
–
Denial of service attacks prevent access, cost revenue/reputation
 Confidentiality: protect data from unauthorized disclosure
–
Leakage of information could violate law, customer confidence
Sometimes referred to as the CIA triad
4
CS346 Advanced Databases
Control measures
 Security is provided by control measures of various types
 Access control: affect who can access the data / the system
–
Different users may have different levels of access
 Inference control: control what data implies about individuals
–
Try to make it impossible to infer facts about individuals
 Flow control: prevent information from escaping the database
E.g. can data be transferred to other applications?
– Control covert channels that can be used to leak data
–
 Data Encryption: store encrypted data in the database
At the field level or at the whole file level
– Trade-off between security and ease of processing
–
5
CS346 Advanced Databases
Information Security and Information Privacy
 The dividing line between security and privacy is hard to draw
 Security: prevent unauthorised use of system and data
E.g. access controls to lock out unauthorised users
– E.g. encryption to hide data from those without the key
–
 Privacy: control the use of data
Ensure that private information does not emerge from queries
– Ability of individuals to control the use of their data
–
 Will focus on database privacy for the remainder
6
CS346 Advanced Databases
Statistical Databases and Privacy
 Statistical databases keep data on large groups
–
E.g. national population data (Office of National Statistics)
 The raw data in statistical databases is confidential
–
Detailed data about individuals e.g. from census
 Users are permitted to retrieve statistics from the data
–
E.g. averages, sums, counts, maximum values etc.
 Providing security for statistical databases is a big challenge
Many crafty ways to extract private information from them
– The database should prevent queries that leak information
–
7
CS346 Advanced Databases
Statistical Database challenges
 Very specific queries can refer to a single person
E.g. SELECT AVG(Salary) FROM EMPLOYEE WHERE AGE=22 AND
POSTCODE=‘W1A 1AA’ AND DNO=5 : 45,000
– SELECT COUNT(*) FROM EMPLOYEE WHERE AGE=22 AND
POSTCODE=‘W1A 1AA’ AND DNO=5 AND SALARY>40000 : 1
–
 How would you detect and reject such queries?
 Can arrange queries where the difference is small
SELECT COUNT(*) FROM EMPLOYEE WHERE AGE>=22 AND
DNO=5 AND SALARY>40000 : 12
– SELECT COUNT(*) FROM EMPLOYEE WHERE AGE>=23 AND
DNO=5 AND SALARY>40000 : 11
–
8
CS346 Advanced Databases
Differential Privacy for Statistical Databases
 Principle: query answers reveals little about any individual
–
Even if adversary knows (almost) everything about everyone else!
 Thus, individuals should be secure about contributing their data
–
What is learnt about them is about the same either way
 Much work on providing differential privacy (DP)
Simple recipe for some data types (e.g. numeric answers)
– Simple rules allow us to reason about composition of results
– More complex algorithms for arbitrary data
–
 Adopted and used by several organizations:
–
9
US Census, Common Data Project, Facebook (?)
Differential Privacy Definition
The output distribution of a differentially private algorithm
changes very little whether or not any individual’s data is
included in the input
– so you should contribute your data
10
Laplace Mechanism
 The Laplace Mechanism adds random noise to query results
–
Scaled to mask the contribution of any individual
Add noise from a symmetric continuous distribution to true answer
– Laplace distribution is a symmetric exponential distribution
– Laplace provides DP for COUNT queries, as shifting the distribution
changes the probability by at most a constant factor
–
11
CS346 Advanced Databases
Sensitivity of Numeric Functions
 For more complex functions, we need to calibrate the noise
to the influence an individual can have on the output
The (global) sensitivity of a function F is the maximum
(absolute) change over all possible adjacent inputs
– S(F) = maxD, D’ : |D-D’|=1 |F(D) – F(D’)|
– Intuition: S(F) characterizes the scale of the influence of one
individual, and hence how much noise we must add
–
 S(F) is small for many common functions
S(F) = 1 for COUNT
– S(F) = 2 for HISTOGRAM
Female Male
– Bounded for other functions (MEAN, covariance matrix…)
–
12
CS346 Advanced Databases
Data Anonymization
The idea of data anonymization is compelling, has many applications
 For Data Sharing
Give real(istic) data to others to study without compromising privacy
of individuals in the data
– Allows third-parties to try new analysis and mining techniques not
thought of by the data owner
–
 For Data Retention and Usage
Various requirements prevent companies from retaining customer
information indefinitely
– E.g. Google progressively anonymizes IP addresses in search logs
– Internal sharing across departments (e.g. billing  marketing)
–
13
CS346 Advanced Databases
Case Study: US Census
 Raw data: information about every US household
–
Who, where; age, gender, racial, income and educational data
 Why released: determine representation, planning
 How anonymized: aggregated to geographic areas (Zip code)
Broken down by various combinations of dimensions
– Released in full after 72 years
–
 Attacks: no reports of successful deanonymization
–
Recent attempts by FBI to access raw data rebuffed
 Consequences: greater understanding of US population
Affects representation, funding of civil projects
– Rich source of data for future historians and genealogists
–
14
CS346 Advanced Databases
Case Study: Netflix Prize
 Raw data: 100M dated ratings from 480K users to 18K movies
 Why released: improve predicting ratings of unlabeled examples
 How anonymized: exact details not described by Netflix
All direct customer information removed
– Only subset of full data; dates modified; some ratings deleted,
– Movie title and year published in full
–
 Attacks: dataset is claimed vulnerable
Attack links data to IMDB where same users also rated movies
– Find matches based on similar ratings or dates in both
–
 Consequences: rich source of user data for researchers
–
15
Unclear how serious the attacks are in practice
CS346 Advanced Databases
Case Study: AOL Search Data
 Raw data: 20M search queries for 650K users from 2006
 Why released: allow researchers to understand search patterns
 How anonymized: user identifiers removed
–
All searches from same user linked by an arbitrary identifier
 Attacks: many successful attacks identified individual users
Ego-surfers: people typed in their own names
– Zip codes and town names identify an area
– NY Times identified user 4417749 as 62yr old widow
–
 Consequences: CTO resigned, two researchers fired
–
16
Well-intentioned effort failed due to inadequate anonymization
CS346 Advanced Databases
 Last time:
–
generalities about security, privacy; case studies privacy
 Next:
–
17
Anonymisation, de-identification, attacks
CS346 Advanced Databases
Models of Anonymization
 Interactive Model (akin to statistical databases)
Data owner acts as “gatekeeper” to data
– Researchers pose queries in some agreed language
– Gatekeeper gives an (anonymized) answer, or refuses to answer
–
 “Send me your code” model
Data owner executes code on their system and reports result
– Cannot be sure that the code is not malicious
–
 Offline, aka “publish and be damned” model
Data owner somehow anonymizes data set
– Publishes the results to the world, and retires
– The model used in most real data releases
–
18
CS346 Advanced Databases
Objectives for Anonymization
 Prevent (high confidence) inference of associations
Prevent inference of salary for an individual in “census”
– Prevent inference of individual’s viewing history in “video”
– Prevent inference of individual’s search history in “search”
– All aim to prevent linking sensitive information to an individual
–
 Prevent inference of presence of an individual in the data set
Satisfying “presence” also satisfies “association” (not vice-versa)
– Presence in a data set can violate privacy (e.g., STD clinic patients)
–
 Have to consider what knowledge might be known to attacker
Background knowledge: facts about the data set (X has salary Y)
– Domain knowledge: broad properties of data (illness Z rare in men)
–
19
CS346 Advanced Databases
Utility
 Anonymization is meaningless if utility of data not considered
The empty data set has perfect privacy, but no utility
– The original data has full utility, but no privacy
–
 What is “utility”? Depends what the application is…
For fixed query set, can look at maximum or average error
– Problem for publishing: want to support unknown applications!
– Need some way to quantify utility of alternate anonymizations
–
20
CS346 Advanced Databases
Definitions of Technical Terms
 Identifiers–uniquely identify, e.g. Social Security Number (SSN)
Step 0: remove all identifiers
– Was not enough for AOL search data
–
 Quasi-Identifiers (QI)—such as DOB, Sex, ZIP Code
Enough to partially identify an individual in a dataset
– DOB+Sex+ZIP unique for 87% of US Residents [Sweeney 02]
–
 Sensitive attributes (SA)—the associations we want to hide
Salary in the “census” example is considered sensitive
– Not always well-defined: only some “search” queries sensitive
– In “video”, association between user and video is sensitive
– One SA can reveal others: bonus may identify salary…
–
21
CS346 Advanced Databases
Tabular Data Example
 Census data recording incomes and demographics
SSN
11-1-111
22-2-222
33-3-333
44-4-444
55-5-555
66-6-666
DOB
1/21/76
4/13/86
2/28/76
1/21/76
4/13/86
2/28/76
Sex
M
F
M
M
F
F
ZIP
53715
53715
53703
53703
53706
53706
Salary
50,000
55,000
60,000
65,000
70,000
75,000
 Releasing SSN  Salary association violates individual’s privacy
–
22
SSN is an identifier, Salary is a sensitive attribute (SA)
CS346 Advanced Databases
Tabular Data Example: De-Identification
 Census data: remove SSN to create de-identified table
–
Remove an attribute from the data
DOB
1/21/76
4/13/86
2/28/76
1/21/76
4/13/86
2/28/76
Sex
M
F
M
M
F
F
ZIP
53715
53715
53703
53703
53706
53706
Salary
50,000
55,000
60,000
65,000
70,000
75,000
 Does the de-identified table preserve an individual’s privacy?
–
23
Depends on what other information an attacker knows
CS346 Advanced Databases
Tabular Data Example: Linking Attack
 De-identified private data + publicly available data
DOB
Sex
ZIP
Salary
1/21/76 M 53715 50,000
4/13/86 F 53715 55,000
2/28/76 M 53703 60,000
1/21/76 M 53703 65,000
4/13/86 F 53706 70,000
2/28/76 F 53706 75,000
SSN
11-1-111
33-3-333
 Cannot uniquely identify either individual’s salary
–
24
DOB is a quasi-identifier (QI)
CS346 Advanced Databases
DOB
1/21/76
2/28/76
Tabular Data Example: Linking Attack
 De-identified private data + publicly available data
DOB Sex ZIP Salary
1/21/76 M 53715 50,000
4/13/86 F 53715 55,000
2/28/76 M 53703 60,000
1/21/76 M 53703 65,000
4/13/86 F 53706 70,000
2/28/76 F 53706 75,000
SSN
DOB Sex
11-1-111 1/21/76 M
33-3-333 2/28/76
M
 Uniquely identified one individual’s salary, but not the other’s
–
25
DOB, Sex are quasi-identifiers (QI)
CS346 Advanced Databases
Tabular Data Example: Linking Attack
 De-identified private data + publicly available data
DOB
Sex
ZIP
Salary
1/21/76 M 53715 50,000
4/13/86 F 53715 55,000
2/28/76 M 53703 60,000
1/21/76 M 53703 65,000
4/13/86 F 53706 70,000
2/28/76 F 53706 75,000
SSN
DOB
Sex
ZIP
11-1-111 1/21/76 M 53715
33-3-333 2/28/76
M
 Uniquely identified both individuals’ salaries
–
26
[DOB, Sex, ZIP] is unique for lots of US residents [Sweeney 02]
CS346 Advanced Databases
53703
Tabular Data Example: Anonymization
 Anonymization through row suppression / deletion
DOB
Sex
*
*
4/13/86 F
2/28/76 M
*
*
4/13/86 F
2/28/76 F
ZIP
*
53715
53703
*
53706
53706
Salary
*
55,000
60,000
*
70,000
75,000
SSN
DOB
Sex
ZIP
11-1-111 1/21/76 M 53715
 Cannot link to private table even with knowledge of QI values
Missing values could take any permitted value
– Looses a lot of information from the data
–
27
CS346 Advanced Databases
Tabular Data Example: Anonymization
 Anonymization through QI attribute generalization
DOB Sex ZIP Salary
1/21/76 M 537** 50,000
4/13/86 F 537** 55,000
2/28/76 * 537** 60,000
1/21/76 M 537** 65,000
4/13/86 F 537** 70,000
2/28/76 * 537** 75,000
SSN
DOB
Sex
ZIP
11-1-111 1/21/76 M 53715
33-3-333 2/28/76 M 53703
 Cannot uniquely identify row with knowledge of QI values
Fewer possibilities than row suppression
– E.g., ZIP = 537** → ZIP  {53700, …, 53799}
–
28
CS346 Advanced Databases
k-Anonymization
 k-anonymity: Table T satisfies k-anonymity with respect to quasiidentifier QI if and only if each tuple in (the multiset) T[QI]
appears at least k times
–
Protects against “linking attack”
 k-anonymization: Table T’ is a k-anonymization of T if T’ is a
generalization/suppression of T, and T’ satisfies k-anonymity
DOB
1/21/76
4/13/86
2/28/76
1/21/76
4/13/86
2/28/76
Sex
M
F
M
M
F
F
ZIP
53715
53715
53703
53703
53706
53706
Salary
50,000
55,000
60,000
65,000
70,000
75,000
→
DOB
1/21/76
4/13/86
2/28/76
1/21/76
4/13/86
2/28/76
ZIP
537**
537**
537**
537**
537**
537**
T’
T
29
Sex
M
F
*
M
F
*
CS346 Advanced Databases
Salary
50,000
55,000
60,000
65,000
70,000
75,000
k-Anonymization and queries
 Data Analysis
Analysis should (implicitly) range over all possible tables
– Example question: what is the salary of individual (1/21/76, M,
53715)? Best guess is 57,500 (average of 50,000 and 65,000)
– Example question: what is the maximum salary of males in 53706?
Could be as small as 50,000, or as big as 75,000
–
DOB
Sex
1/21/76 M
4/13/86 F
2/28/76 *
1/21/76 M
4/13/86 F
2/28/76 *
30
ZIP
537**
537**
537**
537**
537**
537**
CS346 Advanced Databases
Salary
50,000
55,000
60,000
65,000
70,000
75,000
Homogeneity Attack
 Issue: k-anonymity requires each tuple in (the multiset) T[QI] to
appear ≥ k times, but does not say anything about the SA values
If (almost) all SA values in a QI group are equal, loss of privacy!
– The problem is with the choice of grouping, not the data
–
DOB
1/21/76
4/13/86
2/28/76
1/21/76
4/13/86
2/28/76
31
Sex
M
F
M
M
F
F
ZIP
53715
53715
53703
53703
53706
53706
Salary
50,000
55,000 Not Ok!
60,000
50,000
55,000
60,000
→
DOB
1/21/76
4/13/86
2/28/76
Sex ZIP Salary
* 537** 50,000
* 537** 55,000
* 537** 60,000
1/21/76 *
4/13/86 *
2/28/76 *
CS346 Advanced Databases
537** 50,000
537** 55,000
537** 60,000
Homogeneity Attack
 Issue: k-anonymity requires each tuple in (the multiset) T[QI] to
appear ≥ k times, but does not say anything about the SA values
If (almost) all SA values in a QI group are equal, loss of privacy!
– The problem is with the choice of grouping, not the data
– For some groupings, no loss of privacy
–
DOB Sex ZIP Salary
1/21/76 M 53715 50,000
4/13/86 F 53715 55,000
2/28/76 M 53703 60,000
1/21/76 M 53703 50,000
4/13/86 F 53706 55,000
2/28/76 F 53706 60,000
32
→
Ok!
CS346 Advanced Databases
DOB Sex ZIP Salary
76-86 * 53715 50,000
76-86 * 53715 55,000
76-86 * 53703 60,000
76-86
76-86
76-86
*
*
*
53703 50,000
53706 55,000
53706 60,000
Homogeneity
 Intuition: A k-anonymized table T’ represents the set of all
possible tables Ti s.t. T’ is a k-anonymization of Ti
 Lack of diversity of SA values implies that for large fraction of
possible tables, some fact is true, which can violate privacy
DOB Sex ZIP Salary
1/21/76 * 537** 50,000
4/13/86 * 537** 55,000
2/28/76 * 537** 60,000
1/21/76 * 537** 50,000
4/13/86 * 537** 55,000
2/28/76 * 537** 60,000
33
SSN
DOB
Sex
ZIP
11-1-111 1/21/76 M 53715
CS346 Advanced Databases
l-Diversity
 l-Diversity Principle: a table is l-diverse if each of its QI groups
contains at least l “well-represented” values for the SA
–
Frequency l-diversity: for each QI group g, no SA value should occur
more than 1/l fraction of the time
DOB
Sex
ZIP
Salary
1/21/76 * 537** 50,000
4/13/86 * 537** 50,000
2/28/76 * 537** 60,000
1/21/76
4/13/86
2/28/76
*
*
*
537** 55,000
537** 55,000
537** 65,000
 Even l-diversity has its weaknesses: an adversary can use machine
learning techniques to make inferences about individuals
34
CS346 Advanced Databases
Summary
 Concepts in database security: integrity, availability, confidentiality
 Statistical databases, and differential privacy to protect data
 Data anonymization: k-anonymity and l-diversity
 Identifiers, Quasi-identifiers, sensitive attributes
Recommended reading:
 Chapter: “Database Security” in Elmasri and Navathe
 “A Firm Foundation for Private Data Analysis”, Cynthia Dwork
 “k-anonymity”, V. Ciriani, S. De Capitani di Vimercati, S. Foresti, and
P. Samarati
35
CS346 Advanced Databases