Privacy - The Stanford University InfoLab

advertisement
Architectures and Algorithms for Data Privacy
Dilys Thomas
Stanford University, April 30th, 2007
Advisor: Rajeev Motwani
1
RoadMap
 Motivation for Data Privacy Research
 Sanitizing Data for Privacy




Privacy Preserving OLAP
K-Anonymity/ Clustering for Anonymity
Probabilistic Anonymity
Masketeer
 Auditing for Privacy
 Distributed Architectures for Privacy
2
Motivation 1:
Data Privacy in Enterprises
Health
Banking
Personal medical details
Disease history
Clinical research data
Govt. Agencies
Census records
Economic surveys
Hospital Records
Bank statement
Loan Details
Transaction history
Finance
Portfolio information
Credit history
Transaction records
Investment details
Manufacturing
Process details
Blueprints
Production data
Outsourcing
Insurance
Claims records
Accident history
Policy details
Retail Business
Inventory records
Individual credit card details
Audits
Customer data for testing
Remote DB Administration
BPO & KPO
3
Motivation 2:
Country
Government Regulations
Privacy Legislation
Australia
Privacy Amendment Act of 2000
European Union
Personal Data Protection Directive 1998
Hong Kong
Personal Data (Privacy) Ordinance of 1995
United Kingdom
Data Protection Act of 1998
United States
Security Breach Information Act (S.B. 1386) of
2002
Gramm-Leach-Bliley Act of 1999
Health Insurance Portability and Accountability
Act of 1996
4
Motivation 3: Personal Information
Emails
Searches on Google/Yahoo
Profiles on Social Networking sites
Passwords / Credit Card / Personal information
at multiple E-commerce sites / Organizations
 Documents on the Computer / Network




5
Losses due to Lack of Privacy: ID-Theft
• 3% of households in the US affected by ID-Theft
• US $5-50B losses/year
• UK £1.7B losses/year
• AUS $1-4B losses/year
6
RoadMap
 Motivation for Data Privacy Research
 Sanitizing Data for Privacy




Privacy Preserving OLAP
K-Anonymity/ Clustering for Anonymity
Probabilistic Anonymity
Masketeer
 Auditing for Privacy
 Distributed Architectures for Privacy
7
Privacy Preserving Data Analysis
i.e. Online Analytical Processing OLAP
Computing statistics of data collected from multiple
data sources while maintaining the privacy of each
individual source
Agrawal, Srikant, Thomas
SIGMOD 2005
8
Privacy Preserving OLAP
 Motivation
 Problem Definition
 Query Reconstruction
Inversion method
Single attribute
Multiple attributes
Iterative method
 Privacy Guarantees
 Experiments
9
Horizontally Partitioned Personal Information
Client C2
Original Row r2
Perturbed p2
Client C1
Table T for analysis
Original Row r1
at server
Perturbed p1
p1
p2
EXAMPLE: What
number
of children in this
Client
Cn
p
county go to college?
Original Row rn
n
Perturbed pn
10
Vertically Partitioned Enterprise Information
ID
C1
ID
C1
John
1
John
Alice
5
1
Alice
7
ID
C1
C2
C3
Bob
18
John
1
35
9
Alice
7
53
7
Bob
18
Original Relation D1
Perturbed Relation D’1
ID
C2
C3
ID
C2
C3
John
27
9
John
35
9
Alice
53
6
Alice
53
7
Perturbed Joined Relation D’
EXAMPLE: What fraction of United customers to New York fly
Original
Relation
D2 to travel
Perturbed
Relation D’2
Virgin
Atlantic
to London?
11
Privacy Preserving OLAP: Problem Definition
Compute
select count(*) from T
where P1 and P2 and P3 and …. Pk
Eg Find # of people between age[30-50] and salary[80-150]
i.e. COUNTT(
P1 and P2 and P3 and …. Pk )
Goal:
provide error bounds to analyst.
provide privacy guarantees to data sources.
scale to larger # of attributes
12
Perturbation Example: Uniform Retention Replacement
Throw a biased coin
Heads: Retain
Tails: Replace with a random number from a predefined pdf
1
5
Tails
1
4
Tails
3
Heads
4
1
Tails
2
3
Tails
3
BIAS=0.2
HEADS: RETAIN
TAILS: REPLACE U.A.R. FROM [1-5]
13
Retention Replacement Perturbation
 Done for each column
 The replacing pdf need not be uniform
 Best to use original pdf if available/ estimable
 Different columns can have different biases
for retention
14
Single Attribute Example
What is the fraction of people in this building with age
30-50?
 Assume age between 0-100
 Whenever a person enters the building flips a coin of
with heads probability p=0.2.
 Heads -- report true age RETAIN
 Tails
-- random number uniform in 0-100 reported
PERTURB
 Totally 100 randomized numbers collected.
 Of these 22 are 30-50.
 How many among the original are 30-50?
15
Privacy Preserving OLAP
 Motivation
 Problem Definition
 Query Reconstruction
Inversion method
Single attribute
Multiple attributes
Iterative method
 Privacy Guarantees
 Experiments
16
Analysis
20 Retained
80 Perturbed
Out of 100 :
80 perturbed (0.8 fraction), 20 retained (0.2 fraction)
17
Analysis Contd.
16
20
Perturbed, Age[30-50]
Retained
64
Perturbed, NOT Age[30-50]
20% of the 80 randomized rows, i.e. 16 of them
satisfy Age[30-50]. The remaining 64 don’t.
18
Analysis Contd.
6
16
Retained, Age[30-50]
Perturbed, Age[30-50]
14
Retained, NOT Age[30-50]
64
Perturbed, NOT Age[30-50]
Since there were 22 randomized rows in [30-50].
22-16=6 of them come from the 20 retained rows.
19
Scaling up
Total Rows
Age[30-50]
20
6
100
30
?
Thus 30 people had age 30-50 in expectation.
20
Multiple Attributes (k=2)
P1=Age[30-50], P2=Salary[80-150]
Query
Estimated on T
Evaluated on T`
count(¬P 1٨¬P2)
x0
y0
count(¬P 1٨P2)
x1
y1
count(P 1٨¬P2)
x2
y2
count(P 1٨P2)
x3
y3
21
Architecture
22
Formally : Select count(*) from R where Pred
p = retention probability (0.2 in example)
1-p = probability that an element is replaced
by replacing p.d.f.
b = probability that an element from the
replacing p.d.f. satisfies predicate Pred
(
a
in example)
= 1-b
23
Transition matrix
CountT(: P)
CountT( P)
(1-p)a + p
(1-p)a
(1-p)b
(1-p)b+p
= Count
T’(:
P)
CountT’(P)
i.e. Solve xA=y
A00
= probability that original element satisfies
: P and after perturbation satisfies : P
p
= probability it was retained
(1-p)a = probability it was perturbed and satisfies : P
A00
= (1-p)a+p
24
Multiple Attributes
For k attributes,
 x, y are vectors of size 2k
-1
 x=y A
Where A=A1 A2 .. Ak [Tensor Product]
Ai is the transition matrix for column i
25
Error Bounds
 In our example, we want to say when
estimated answer is 30, the actual answer
lies in [28-32] with probability greater than
0.9
 Given T !a T’ , with n rows f(T) is (n,e,d)
reconstructible by g(T’) if
|f(T) – g(T’)| < max (e, e f(T)) with
probability greater than (1- d).
ef(T) =2, d =0.1 in above example
26
Theoretical Basis and Results
Theorem: Fraction, f, of rows in [low,high] in the
original table estimated by matrix inversion on
the table obtained after uniform perturbation is a
(n,e , d ) estimator for f if n > 4 log(2/d)(p e)-2 ,
by Chernoff bounds
Theorem: Vector, x, obtained by matrix inversion
is the MLE (maximum likelihood estimator), by
using Lagrangian Multiplier method and
showing that the Hessian is negative
27
Iterative Algorithm [AS00]
Initialize:
x0=y
Iterate:
xpT+1 = S q=0t yq (apqxpT / (Sr=0t arq xrT))
[ By Application of Bayes Rule]
Stop Condition:
Two consecutive x iterates do not differ much
29
Iterative Algorithm
We had proved,
 Theorem: Inversion Algorithm gives the MLE
 Theorem [AA01]: The Iterative Algorithm gives
the MLE with the additional constraint that
0 < xi , 8 0 < i < 2k-1
 Models the fact the probabilities are non-negative
 Results better as shown in experiments
30
Privacy Guarantees
Say initially know with probability < 0.3
that Alice’s age > 25
After seeing perturbed value can say that with
probability > 0.95
Then we say there is a (0.3,0.95) privacy breach
More subtle differential privacy in the thesis
32
Privacy Preserving OLAP





Motivation
Problem Definition
Query Reconstruction
Privacy Guarantees
Experiments
33
Experiments
 Real data: Census data from the UCI Machine
Learning Repository having 32000 rows
 Synthetic data: Generated multiple columns of
Zipfian data, number of rows varied between
1000 and 1000000
 Error metric: l1 norm of difference between x
and y.
 L1 norm between 2 probability distributions
Eg for 1-dim queries |x1 – y1| + | x0 – y0|
34
Inversion vs Iterative Reconstruction
2 attributes: Census Data
3 attributes: Census Data
Iterative algorithm (MLE on
constrained space)
outperforms Inversion
(global MLE)
35
Error as a function of Number of Columns:
Iterative Algorithm: Zipf Data
The error in the iterative algorithm flattens
out as its maximum value is bounded by 2
36
Error as a function of Number of Columns
Census Data
Inversion Algorithm
Iterative Algorithm
Error increases
exponentially with
increase in number of
columns
37
Error as a function of number of Rows
Error decreases as
increases
as number of rows, n
38
Conclusion
Possible to run OLAP on data across multiple
servers so that probabilistically approximate
answers are obtained and data privacy is
maintained
The techniques have been tested
experimentally on real and synthetic data.
More experiments in the paper.
Privacy Preserving OLAP is Practical
39
RoadMap
 Motivation for Data Privacy Research
 Sanitizing Data for Privacy




Privacy Preserving OLAP
K-Anonymity/ Clustering for Anonymity
Probabilistic Anonymity
Masketeer
 Auditing for Privacy
 Distributed Architectures for Privacy
40
Anonymizing Tables: ICDT05
Creating tables that do not identify individuals
for research or out-sourced software
development purposes
Aggarwal, Feder, Kenthapadi, Motwani,
Panigrahy, Thomas, Zhu
41
Achieving Anonymity via Clustering: PODS06
Aggarwal, Feder, Kenthapadi, Khuller,
Panigrahy, Thomas, Zhu
Probabilistic Anonymity: (submitted)
Lodha, Thomas
42
Data Privacy
 Value disclosure: What is the value of attribute
salary of person X
 Perturbation
 Privacy Preserving OLAP
 Identity disclosure: Whether an individual is
present in the database table
 Randomization, K-Anonymity etc.
 Data for Outsourcing / Research
43
Original Dataset
Identifying
Sensitive
SSN
Name
DOB
Gender
Zip code
Disease
614
Sara
03/04/76
F
94305
Flu
615
Joan
07/11/80
F
94307
Cold
629
Karan
05/09/55
M
94301
Diabetes
710
Harris
11/23/62
M
94305
Flu
840
Carl
11/23/62
M
94059
Arthritis
780
Amanda
01/07/50
F
94042
Heart problem
619
Rob
04/08/43
M
94042
Arthritis
44
Randomized Dataset
Identifying
Sensitive
SSN
Name
DOB
Gender
Zip code
Disease
101
Amy
03/04/76
F
94305
Flu
102
Betty
07/11/80
F
94307
Cold
103
Clarke
05/09/55
M
94301
Diabetes
104
David
11/23/62
M
94305
Flu
105
Earl
11/23/62
M
94059
Arthritis
106
Finy
01/07/50
F
94042
Heart problem
107
George
04/08/43
M
94042
Arthritis
45
Quasi-Identifiers
Sensitive
Uniquely
identify
you!
DOB
Gender
Zip code
Disease
03/04/76
F
94305
Flu
07/11/80
F
94307
Cold
05/09/55
M
Quasi-identifiers:
94301approximate
Diabetesforeign keys
12/30/72
M
94305
Flu
11/23/62
M
94059
Arthritis
01/07/50
F
94042
Heart problem
04/08/43
M
94042
Arthritis
46
k-Anonymity Model [Swe00]
 Modify some entries of quasi-identifiers

each modified row becomes identical to at least k-1
other rows with respect to quasi-identifiers
 Individual records hidden in a crowd of size k
47
Original Table
Age
Salary
Amy
25
50
Brian
27
60
Carol
29
100
David
35
110
Evelyn
39
120
48
Suppressing all entries: No Utility
Age
Salary
Amy
*
*
Brian
*
*
Carol
*
*
David
*
*
Evelyn
*
*
49
2-Anonymity with Clustering
Age
Salary
Amy
[25-29]
[50-100]
Brian
[25-29]
[50-100]
Carol
[25-29]
[50-100]
David
[35-39]
[110-120]
Evelyn
[35-39]
[110-120]
Cluster centers published
27=(25+27+29)/3
70=(50+60+100)/3
37=(35+39)/2
115=(110+120)/2
Clustering formulation: NP Hard
50
Clustering Metrics
10 points, radius 5
50 points, radius 15
20 points, radius 10
54
r-center Clustering: Minimize Maximum Cluster Size
2d
2d
2d
55
Cellular Clustering: Linear Program
Minimize Sc ( Si xicdc + fc yc)
Sum of Cellular cost and facility cost
Subject to:
Sc xic ¸ 1 Each Point belongs to a cluster
xic· yc Cluster must be opened for point to belong
0 · xic · 1 Points belong to clusters positively
0 · yc · 1 Clusters are opened positively
56
Quasi-identifier
Apple
Guava
0.6 Fraction uniquely identified by Fruit.
Hence Fruit is 0.6 Quasi-identifier.
Orange
Apple
Banana
0.87 fraction of U.S. population uniquely
identified by (DOB, Gender, Zipcode)
hence it is a 0.87 quasi-identifier
58
Quasi-Identifier
Find probability distribution over D distinct
values that maximizes expected number of
uniquely identified fraction of records.
D distinct values, n rows
If D <=n
D/en (skewed distribution)
Else
e-n/D (uniform distribution)
59
Distinct values- Identifier
 DOB : 60*365=2*104
 Gender: 2
 Zipcode: 105
 (DOB, Gender, Zipcode) has together
2*104*2*105=4*109
 US population=3*108
 Fraction of singletons=
e-3*10^8/4*10^9=0.92
60
Distinct values and K-anonymity
 Eg. Apply HIPAA to
(Age in Years, Zipcode, Gender,Doctor details)
 Want k=20,000=2*104 anonymity with n=300
million=3*108 people.
 The number of distinct values is
D=n/k=1.5*104
 D=Distinct values=
z(zipcode)*100(age in years)*2(gender)=200z
 1.5*104=200z, z=102 approximately.
 Retain first two digits of zipcode (retain states)
61
Experiments
 Efficient Algorithms based on randomized algorithms to
find quantiles in small space

10 seconds to anonymize quarter million rows. Or
approximately 3GB per hour on a machine running 2.66Ghz
Processor, 504 MB RAM, Windows XP with Service Pack 2



order of magnitude better in running time for a quasi-identifier
of size 10 than previous implementation
Optimal algorithms to anonymize the dataset.
Scalable



Almost independent of anonymity parameter k
linear in quasi-identifier size (previously exponential)
linear in dataset size
67
Masketeer: A tool for data privacy
Das, Lodha, Patwardhan, Sundaram, Thomas.
72
RoadMap
 Motivation for Data Privacy Research
 Sanitizing Data for Privacy




Privacy Preserving OLAP
K-Anonymity/ Clustering for Anonymity
Probabilistic Anonymity
Masketeer
 Auditing for Privacy
 Distributed Architectures for Privacy
73
Auditing Batches of SQL Queries
Given a set of SQL queries that have been posed
over a database, determine whether some subset
of these queries have revealed private information
about an individual or a group of individuals
Motwani, Nabar, Thomas
PDM Workshop with ICDE 2007
74
Example
SELECT zipcode
FROM Patients p
WHERE p.disease = ‘diabetes’
AUDIT zipcode
FROM Patients p
WHERE p.disease = ‘high blood pressure’
AUDIT disease
FROM Patients p
WHERE p.zipcode = 94305
Not Suspicious wrt this
Suspicious if someone in
94305 has diabetes
76
Query Suspicious wrt an Audit Expression
 If all columns of audit expression are covered
by the query
 If the audit expression and the query have one
tuple in common
77
SQL Batch Auditing
Query 1
Query 2
Query 3
Query 4
Audit expression
Audited tuple
columns are
covered
syntactically
Query batch semantically
suspicious wrt audit
expression iff queries together cover all audited columns
table Ttable T
of at least audited tuple on some
81
Syntactic and Semantic Auditing
 Checking for semantic suspiciousness has
polynomial time algorithm
 Checking for syntactic suspiciousness is NP
complete
82
RoadMap
 Motivation for Data Privacy Research
 Sanitizing Data for Privacy




Privacy Preserving OLAP
K-Anonymity/ Clustering for Anonymity
Probabilistic Anonymity
Masketeer
 Auditing for Privacy
 Distributed Architectures for Privacy
83
Two Can Keep a Secret:
A Distributed Architecture for Secure Database Services
How to distribute data across multiple sites for
(1)redundancy and (2) privacy so that a single
site being compromised does not lead to data loss
Aggarwal, Bawa, Ganesan, Garcia-Molina, Kenthapadi,
Motwani, Srivastava, Thomas, Xu
CIDR 2005
84
Distributing data and Partitioning and Integrating
Queries for Secure Distributed Databases
Feder, Ganapathy, Garcia-Molina,
Motwani, Thomas
Work in Progress
85
Motivation
 Data outsourcing growing in popularity
 Cheap, reliable data storage and
management
 1TB $399  < $0.5 per GB
 $5000 – Oracle 10g / SQL Server
 $68k/year DBAdmin
 Privacy concerns looming ever larger
 High-profile thefts (often insiders)




UCLA lost 900k records
Berkeley lost laptop with sensitive information
Acxiom, JP Morgan, Choicepoint
www.privacyrights.org
86
Present solutions
 Application level: Salesforce.com
 On-Demand Customer Relationship Managemen
$65/User/Month ---- $995 / 5 Users / 1 Year
 Amazon Elastic Compute Cloud
 1 instance = 1.7Ghz x86 processor, 1.75GB RAM,
160GB local disk, 250 Mb/s network bandwidth
Elastic, Completely controlled, Reliable, Secure
$0.10 per instance hour
$0.20 per GB of data in/out of Amazon
$0.15 per GB-Month of Amazon S3 storage used
 Google Apps for your domain
Small businesses, Enterprise, School, Family or Group
87
Encryption Based Solution
Client
Query Q
Answer
Encrypt
Client-side
Processor
DSP
Q’
“Relevant Data”
Problem: Q’  “SELECT *”
88
The Power of Two
Client
DSP1
DSP2
89
The Power of Two
Query Q
Q1
DSP1
Q2
DSP2
Client-side
Processor
Key: Ensure Cost (Q1)+Cost (Q2)  Cost (Q)
90
SB1386 Privacy
{ Name, SSN},
{ Name, LicenceNo}
{ Name, CaliforniaID}
{ Name, AccountNumber}
{ Name, CreditCardNo, SecurityCode}
are all to be kept private.
 A set is private if at least one of its elements is
“hidden”.


Element in encrypted form ok
91
Techniques
 Vertical Fragmentation

Partition attributes across R1 and R2
E.g., to obey constraint {Name, SSN},
R1  Name, R2  SSN
Use tuple IDs for reassembly. R = R1 JOIN R2


For each value v, construct random bit seq. r
R1  v XOR r, R2  r


R1  EK (v) R2  K
Can detect equality and push selections with equality
predicate


 Encoding
One-time Pad
Deterministic Encryption
Random addition


R1  v+r , R2  r
Can push aggregate SUM
92
Example
 An Employee relation: {Name, DoB, Position, Salary,
Gender, Email, Telephone, ZipCode}
 Privacy Constraints




{Telephone}, {Email}
{Name, Salary}, {Name, Position}, {Name, DoB}
{DoB, Gender, ZipCode}
{Position, Salary}, {Salary, DoB}
 Will use just Vertical Fragmentation and Encoding.
93
Example (2)
R1
Constraints
{Telephone}
{Email}
{Name, Salary}
{Name, Position}
{Name, DoB}
{DoB, Gender,ZipCode}
{Position, Salary}
{Salary, DoB}
Salary
ID
Name
DoB
Position
Salary
Gender
Email
Telephone
ZipCode
ID
R2
94
Partitioning, Execution
 Partitioning Problem



Partition to minimize communication cost for given
workload
Even simplified version
hard to approximate
Hill Climbing algorithm after starting with weighted set
cover
 Query Reformulation and Execution


Consider only centralized plans
Algorithm to partition select and where clause predicates
between the two partitions
95
Thank You!
99
Acknowledgements: Stanford Faculty
 Advisor: Rajeev Motwani
 Members of Orals Committee:
Rajeev Motwani, Hector Garcia-Molina,
Dan Boneh, John Mitchell, Ashish Goel
 Many other professors at Stanford, esp.
Jennifer Widom
100
Acknowledgements: Projects
 STREAM: Jennifer Widom, Rajeev
Motwani
 PORTIA: Hector Garcia-Molina, Rajeev
Motwani, Dan Boneh, John Mitchell
 TRUST: Dan Boneh, John Mitchell,
Rajeev Motwani, Hector Garcia-Molina
 RAIN: Rajeev Motwani, Ashish Goel,
Amin Saberi
101
Acknowledgements: Internship Mentors
Rakesh Agrawal, Ramakrishnan Srikant,
Surajit Chaudhuri, Nicolas Bruno, Phil
Gibbons, Sachin Lodha, Anand
Rajaraman
102
Acknowledgements: CoAuthors[A-K]
Gagan Aggarwal, Rakesh Agrawal,
Arvind Arasu, Brian Babcock,
Shivnath Babu, Mayank Bawa,
Nicolas Bruno, Renato Carmo,
Surajit Chaudhuri, Mayur Datar,
Prasenjit Das, A A Diwan, Tomás
Feder, Vignesh Ganapathy, Prasanna
Ganesan, Hector Garcia-Molina,
Keith Ito, Krishnaram Kenthapadi,
Samir Khuller, Yoshiharu
Kohayakawa,
103
Acknowledgements: CoAuthors[L-Z]
Eduardo Sany Laber, Sachin Lodha,
Nina Mishra, Rajeev Motwani,
Shubha Nabar, Itaru Nishizawa,
Liadan Boyen, Rina Panigrahy,
Nikhil Patwardhan, Ramakrishnan
Srikant, Utkarsh Srivastava, S.
Sudarshan, Sharada Sundaram,
Rohit Varma, Jennifer Widom, Ying
Xu, An Zhu
104
Acknowledgements: Others not in previous list
 Aristides, Gurmeet, Aleksandra, Sergei, Damon,
Anupam, Arnab, Aaron, Adam, Mukund, Vivek, Anish,
Parag, Vijay, Piotr, Moses, Sudipto, Bob, David, Paul,
Zoltan etc.
 Members of Rajeev’s group, Stanford Theory, Database,
Security groups, Also many PhD students of the
incoming year 2002 -- Paul etc. and many other students
at Stanford
 Lynda, Maggie, Wendy, Jam, Kathi, Claire, Meredith for
administrative help
 Andy, Miles, Lilian for keeping the machines running!
 Various outing clubs and groups at Stanford, Catholic
community here, SIA, RAINS groups, Ivgrad, DB Movie
and Social Committee
105
Acknowledgements: More!
 Jojy Michael, Joshua Easow and families
 Roommates: Omkar Deshpande, Alex Joseph,
Mayur Naik, Rajiv Agrawal, Utkarsh Srivastava,
Rajat Raina, Jim Cybluski, Blake Blailey
 Batchmates and Professors from IITs
 Friends and relatives, grandparents
 sister Dina, and Parents
106
Data Streams
 Traditional DBMS – data stored in finite,
persistent data sets
 New Applications – data input as continuous,
ordered data streams







Network and traffic monitoring
Telecom call records
Network security
Financial applications
Sensor networks
Web logs and clickstreams
Massive data sets
107
Scheduling Algorithms for Data Streams
 Minimizing the overhead over the disk system.
Motwani, Thomas. SODA 2004
 Operator Scheduling in Data Stream Systems –
Minimizing memory consumption and latency.
Babu, Babcock, Datar, Motwani, Thomas. VLDB
Journal 2004
 Stanford STREAM Data Manager. Stanford
Stream Group. IEEE Bulletin 2003
108
Download