Architectures and Algorithms for Data Privacy Dilys Thomas Stanford University, April 30th, 2007 Advisor: Rajeev Motwani 1 RoadMap Motivation for Data Privacy Research Sanitizing Data for Privacy Privacy Preserving OLAP K-Anonymity/ Clustering for Anonymity Probabilistic Anonymity Masketeer Auditing for Privacy Distributed Architectures for Privacy 2 Motivation 1: Data Privacy in Enterprises Health Banking Personal medical details Disease history Clinical research data Govt. Agencies Census records Economic surveys Hospital Records Bank statement Loan Details Transaction history Finance Portfolio information Credit history Transaction records Investment details Manufacturing Process details Blueprints Production data Outsourcing Insurance Claims records Accident history Policy details Retail Business Inventory records Individual credit card details Audits Customer data for testing Remote DB Administration BPO & KPO 3 Motivation 2: Country Government Regulations Privacy Legislation Australia Privacy Amendment Act of 2000 European Union Personal Data Protection Directive 1998 Hong Kong Personal Data (Privacy) Ordinance of 1995 United Kingdom Data Protection Act of 1998 United States Security Breach Information Act (S.B. 1386) of 2002 Gramm-Leach-Bliley Act of 1999 Health Insurance Portability and Accountability Act of 1996 4 Motivation 3: Personal Information Emails Searches on Google/Yahoo Profiles on Social Networking sites Passwords / Credit Card / Personal information at multiple E-commerce sites / Organizations Documents on the Computer / Network 5 Losses due to Lack of Privacy: ID-Theft • 3% of households in the US affected by ID-Theft • US $5-50B losses/year • UK £1.7B losses/year • AUS $1-4B losses/year 6 RoadMap Motivation for Data Privacy Research Sanitizing Data for Privacy Privacy Preserving OLAP K-Anonymity/ Clustering for Anonymity Probabilistic Anonymity Masketeer Auditing for Privacy Distributed Architectures for Privacy 7 Privacy Preserving Data Analysis i.e. Online Analytical Processing OLAP Computing statistics of data collected from multiple data sources while maintaining the privacy of each individual source Agrawal, Srikant, Thomas SIGMOD 2005 8 Privacy Preserving OLAP Motivation Problem Definition Query Reconstruction Inversion method Single attribute Multiple attributes Iterative method Privacy Guarantees Experiments 9 Horizontally Partitioned Personal Information Client C2 Original Row r2 Perturbed p2 Client C1 Table T for analysis Original Row r1 at server Perturbed p1 p1 p2 EXAMPLE: What number of children in this Client Cn p county go to college? Original Row rn n Perturbed pn 10 Vertically Partitioned Enterprise Information ID C1 ID C1 John 1 John Alice 5 1 Alice 7 ID C1 C2 C3 Bob 18 John 1 35 9 Alice 7 53 7 Bob 18 Original Relation D1 Perturbed Relation D’1 ID C2 C3 ID C2 C3 John 27 9 John 35 9 Alice 53 6 Alice 53 7 Perturbed Joined Relation D’ EXAMPLE: What fraction of United customers to New York fly Original Relation D2 to travel Perturbed Relation D’2 Virgin Atlantic to London? 11 Privacy Preserving OLAP: Problem Definition Compute select count(*) from T where P1 and P2 and P3 and …. Pk Eg Find # of people between age[30-50] and salary[80-150] i.e. COUNTT( P1 and P2 and P3 and …. Pk ) Goal: provide error bounds to analyst. provide privacy guarantees to data sources. scale to larger # of attributes 12 Perturbation Example: Uniform Retention Replacement Throw a biased coin Heads: Retain Tails: Replace with a random number from a predefined pdf 1 5 Tails 1 4 Tails 3 Heads 4 1 Tails 2 3 Tails 3 BIAS=0.2 HEADS: RETAIN TAILS: REPLACE U.A.R. FROM [1-5] 13 Retention Replacement Perturbation Done for each column The replacing pdf need not be uniform Best to use original pdf if available/ estimable Different columns can have different biases for retention 14 Single Attribute Example What is the fraction of people in this building with age 30-50? Assume age between 0-100 Whenever a person enters the building flips a coin of with heads probability p=0.2. Heads -- report true age RETAIN Tails -- random number uniform in 0-100 reported PERTURB Totally 100 randomized numbers collected. Of these 22 are 30-50. How many among the original are 30-50? 15 Privacy Preserving OLAP Motivation Problem Definition Query Reconstruction Inversion method Single attribute Multiple attributes Iterative method Privacy Guarantees Experiments 16 Analysis 20 Retained 80 Perturbed Out of 100 : 80 perturbed (0.8 fraction), 20 retained (0.2 fraction) 17 Analysis Contd. 16 20 Perturbed, Age[30-50] Retained 64 Perturbed, NOT Age[30-50] 20% of the 80 randomized rows, i.e. 16 of them satisfy Age[30-50]. The remaining 64 don’t. 18 Analysis Contd. 6 16 Retained, Age[30-50] Perturbed, Age[30-50] 14 Retained, NOT Age[30-50] 64 Perturbed, NOT Age[30-50] Since there were 22 randomized rows in [30-50]. 22-16=6 of them come from the 20 retained rows. 19 Scaling up Total Rows Age[30-50] 20 6 100 30 ? Thus 30 people had age 30-50 in expectation. 20 Multiple Attributes (k=2) P1=Age[30-50], P2=Salary[80-150] Query Estimated on T Evaluated on T` count(¬P 1٨¬P2) x0 y0 count(¬P 1٨P2) x1 y1 count(P 1٨¬P2) x2 y2 count(P 1٨P2) x3 y3 21 Architecture 22 Formally : Select count(*) from R where Pred p = retention probability (0.2 in example) 1-p = probability that an element is replaced by replacing p.d.f. b = probability that an element from the replacing p.d.f. satisfies predicate Pred ( a in example) = 1-b 23 Transition matrix CountT(: P) CountT( P) (1-p)a + p (1-p)a (1-p)b (1-p)b+p = Count T’(: P) CountT’(P) i.e. Solve xA=y A00 = probability that original element satisfies : P and after perturbation satisfies : P p = probability it was retained (1-p)a = probability it was perturbed and satisfies : P A00 = (1-p)a+p 24 Multiple Attributes For k attributes, x, y are vectors of size 2k -1 x=y A Where A=A1 A2 .. Ak [Tensor Product] Ai is the transition matrix for column i 25 Error Bounds In our example, we want to say when estimated answer is 30, the actual answer lies in [28-32] with probability greater than 0.9 Given T !a T’ , with n rows f(T) is (n,e,d) reconstructible by g(T’) if |f(T) – g(T’)| < max (e, e f(T)) with probability greater than (1- d). ef(T) =2, d =0.1 in above example 26 Theoretical Basis and Results Theorem: Fraction, f, of rows in [low,high] in the original table estimated by matrix inversion on the table obtained after uniform perturbation is a (n,e , d ) estimator for f if n > 4 log(2/d)(p e)-2 , by Chernoff bounds Theorem: Vector, x, obtained by matrix inversion is the MLE (maximum likelihood estimator), by using Lagrangian Multiplier method and showing that the Hessian is negative 27 Iterative Algorithm [AS00] Initialize: x0=y Iterate: xpT+1 = S q=0t yq (apqxpT / (Sr=0t arq xrT)) [ By Application of Bayes Rule] Stop Condition: Two consecutive x iterates do not differ much 29 Iterative Algorithm We had proved, Theorem: Inversion Algorithm gives the MLE Theorem [AA01]: The Iterative Algorithm gives the MLE with the additional constraint that 0 < xi , 8 0 < i < 2k-1 Models the fact the probabilities are non-negative Results better as shown in experiments 30 Privacy Guarantees Say initially know with probability < 0.3 that Alice’s age > 25 After seeing perturbed value can say that with probability > 0.95 Then we say there is a (0.3,0.95) privacy breach More subtle differential privacy in the thesis 32 Privacy Preserving OLAP Motivation Problem Definition Query Reconstruction Privacy Guarantees Experiments 33 Experiments Real data: Census data from the UCI Machine Learning Repository having 32000 rows Synthetic data: Generated multiple columns of Zipfian data, number of rows varied between 1000 and 1000000 Error metric: l1 norm of difference between x and y. L1 norm between 2 probability distributions Eg for 1-dim queries |x1 – y1| + | x0 – y0| 34 Inversion vs Iterative Reconstruction 2 attributes: Census Data 3 attributes: Census Data Iterative algorithm (MLE on constrained space) outperforms Inversion (global MLE) 35 Error as a function of Number of Columns: Iterative Algorithm: Zipf Data The error in the iterative algorithm flattens out as its maximum value is bounded by 2 36 Error as a function of Number of Columns Census Data Inversion Algorithm Iterative Algorithm Error increases exponentially with increase in number of columns 37 Error as a function of number of Rows Error decreases as increases as number of rows, n 38 Conclusion Possible to run OLAP on data across multiple servers so that probabilistically approximate answers are obtained and data privacy is maintained The techniques have been tested experimentally on real and synthetic data. More experiments in the paper. Privacy Preserving OLAP is Practical 39 RoadMap Motivation for Data Privacy Research Sanitizing Data for Privacy Privacy Preserving OLAP K-Anonymity/ Clustering for Anonymity Probabilistic Anonymity Masketeer Auditing for Privacy Distributed Architectures for Privacy 40 Anonymizing Tables: ICDT05 Creating tables that do not identify individuals for research or out-sourced software development purposes Aggarwal, Feder, Kenthapadi, Motwani, Panigrahy, Thomas, Zhu 41 Achieving Anonymity via Clustering: PODS06 Aggarwal, Feder, Kenthapadi, Khuller, Panigrahy, Thomas, Zhu Probabilistic Anonymity: (submitted) Lodha, Thomas 42 Data Privacy Value disclosure: What is the value of attribute salary of person X Perturbation Privacy Preserving OLAP Identity disclosure: Whether an individual is present in the database table Randomization, K-Anonymity etc. Data for Outsourcing / Research 43 Original Dataset Identifying Sensitive SSN Name DOB Gender Zip code Disease 614 Sara 03/04/76 F 94305 Flu 615 Joan 07/11/80 F 94307 Cold 629 Karan 05/09/55 M 94301 Diabetes 710 Harris 11/23/62 M 94305 Flu 840 Carl 11/23/62 M 94059 Arthritis 780 Amanda 01/07/50 F 94042 Heart problem 619 Rob 04/08/43 M 94042 Arthritis 44 Randomized Dataset Identifying Sensitive SSN Name DOB Gender Zip code Disease 101 Amy 03/04/76 F 94305 Flu 102 Betty 07/11/80 F 94307 Cold 103 Clarke 05/09/55 M 94301 Diabetes 104 David 11/23/62 M 94305 Flu 105 Earl 11/23/62 M 94059 Arthritis 106 Finy 01/07/50 F 94042 Heart problem 107 George 04/08/43 M 94042 Arthritis 45 Quasi-Identifiers Sensitive Uniquely identify you! DOB Gender Zip code Disease 03/04/76 F 94305 Flu 07/11/80 F 94307 Cold 05/09/55 M Quasi-identifiers: 94301approximate Diabetesforeign keys 12/30/72 M 94305 Flu 11/23/62 M 94059 Arthritis 01/07/50 F 94042 Heart problem 04/08/43 M 94042 Arthritis 46 k-Anonymity Model [Swe00] Modify some entries of quasi-identifiers each modified row becomes identical to at least k-1 other rows with respect to quasi-identifiers Individual records hidden in a crowd of size k 47 Original Table Age Salary Amy 25 50 Brian 27 60 Carol 29 100 David 35 110 Evelyn 39 120 48 Suppressing all entries: No Utility Age Salary Amy * * Brian * * Carol * * David * * Evelyn * * 49 2-Anonymity with Clustering Age Salary Amy [25-29] [50-100] Brian [25-29] [50-100] Carol [25-29] [50-100] David [35-39] [110-120] Evelyn [35-39] [110-120] Cluster centers published 27=(25+27+29)/3 70=(50+60+100)/3 37=(35+39)/2 115=(110+120)/2 Clustering formulation: NP Hard 50 Clustering Metrics 10 points, radius 5 50 points, radius 15 20 points, radius 10 54 r-center Clustering: Minimize Maximum Cluster Size 2d 2d 2d 55 Cellular Clustering: Linear Program Minimize Sc ( Si xicdc + fc yc) Sum of Cellular cost and facility cost Subject to: Sc xic ¸ 1 Each Point belongs to a cluster xic· yc Cluster must be opened for point to belong 0 · xic · 1 Points belong to clusters positively 0 · yc · 1 Clusters are opened positively 56 Quasi-identifier Apple Guava 0.6 Fraction uniquely identified by Fruit. Hence Fruit is 0.6 Quasi-identifier. Orange Apple Banana 0.87 fraction of U.S. population uniquely identified by (DOB, Gender, Zipcode) hence it is a 0.87 quasi-identifier 58 Quasi-Identifier Find probability distribution over D distinct values that maximizes expected number of uniquely identified fraction of records. D distinct values, n rows If D <=n D/en (skewed distribution) Else e-n/D (uniform distribution) 59 Distinct values- Identifier DOB : 60*365=2*104 Gender: 2 Zipcode: 105 (DOB, Gender, Zipcode) has together 2*104*2*105=4*109 US population=3*108 Fraction of singletons= e-3*10^8/4*10^9=0.92 60 Distinct values and K-anonymity Eg. Apply HIPAA to (Age in Years, Zipcode, Gender,Doctor details) Want k=20,000=2*104 anonymity with n=300 million=3*108 people. The number of distinct values is D=n/k=1.5*104 D=Distinct values= z(zipcode)*100(age in years)*2(gender)=200z 1.5*104=200z, z=102 approximately. Retain first two digits of zipcode (retain states) 61 Experiments Efficient Algorithms based on randomized algorithms to find quantiles in small space 10 seconds to anonymize quarter million rows. Or approximately 3GB per hour on a machine running 2.66Ghz Processor, 504 MB RAM, Windows XP with Service Pack 2 order of magnitude better in running time for a quasi-identifier of size 10 than previous implementation Optimal algorithms to anonymize the dataset. Scalable Almost independent of anonymity parameter k linear in quasi-identifier size (previously exponential) linear in dataset size 67 Masketeer: A tool for data privacy Das, Lodha, Patwardhan, Sundaram, Thomas. 72 RoadMap Motivation for Data Privacy Research Sanitizing Data for Privacy Privacy Preserving OLAP K-Anonymity/ Clustering for Anonymity Probabilistic Anonymity Masketeer Auditing for Privacy Distributed Architectures for Privacy 73 Auditing Batches of SQL Queries Given a set of SQL queries that have been posed over a database, determine whether some subset of these queries have revealed private information about an individual or a group of individuals Motwani, Nabar, Thomas PDM Workshop with ICDE 2007 74 Example SELECT zipcode FROM Patients p WHERE p.disease = ‘diabetes’ AUDIT zipcode FROM Patients p WHERE p.disease = ‘high blood pressure’ AUDIT disease FROM Patients p WHERE p.zipcode = 94305 Not Suspicious wrt this Suspicious if someone in 94305 has diabetes 76 Query Suspicious wrt an Audit Expression If all columns of audit expression are covered by the query If the audit expression and the query have one tuple in common 77 SQL Batch Auditing Query 1 Query 2 Query 3 Query 4 Audit expression Audited tuple columns are covered syntactically Query batch semantically suspicious wrt audit expression iff queries together cover all audited columns table Ttable T of at least audited tuple on some 81 Syntactic and Semantic Auditing Checking for semantic suspiciousness has polynomial time algorithm Checking for syntactic suspiciousness is NP complete 82 RoadMap Motivation for Data Privacy Research Sanitizing Data for Privacy Privacy Preserving OLAP K-Anonymity/ Clustering for Anonymity Probabilistic Anonymity Masketeer Auditing for Privacy Distributed Architectures for Privacy 83 Two Can Keep a Secret: A Distributed Architecture for Secure Database Services How to distribute data across multiple sites for (1)redundancy and (2) privacy so that a single site being compromised does not lead to data loss Aggarwal, Bawa, Ganesan, Garcia-Molina, Kenthapadi, Motwani, Srivastava, Thomas, Xu CIDR 2005 84 Distributing data and Partitioning and Integrating Queries for Secure Distributed Databases Feder, Ganapathy, Garcia-Molina, Motwani, Thomas Work in Progress 85 Motivation Data outsourcing growing in popularity Cheap, reliable data storage and management 1TB $399 < $0.5 per GB $5000 – Oracle 10g / SQL Server $68k/year DBAdmin Privacy concerns looming ever larger High-profile thefts (often insiders) UCLA lost 900k records Berkeley lost laptop with sensitive information Acxiom, JP Morgan, Choicepoint www.privacyrights.org 86 Present solutions Application level: Salesforce.com On-Demand Customer Relationship Managemen $65/User/Month ---- $995 / 5 Users / 1 Year Amazon Elastic Compute Cloud 1 instance = 1.7Ghz x86 processor, 1.75GB RAM, 160GB local disk, 250 Mb/s network bandwidth Elastic, Completely controlled, Reliable, Secure $0.10 per instance hour $0.20 per GB of data in/out of Amazon $0.15 per GB-Month of Amazon S3 storage used Google Apps for your domain Small businesses, Enterprise, School, Family or Group 87 Encryption Based Solution Client Query Q Answer Encrypt Client-side Processor DSP Q’ “Relevant Data” Problem: Q’ “SELECT *” 88 The Power of Two Client DSP1 DSP2 89 The Power of Two Query Q Q1 DSP1 Q2 DSP2 Client-side Processor Key: Ensure Cost (Q1)+Cost (Q2) Cost (Q) 90 SB1386 Privacy { Name, SSN}, { Name, LicenceNo} { Name, CaliforniaID} { Name, AccountNumber} { Name, CreditCardNo, SecurityCode} are all to be kept private. A set is private if at least one of its elements is “hidden”. Element in encrypted form ok 91 Techniques Vertical Fragmentation Partition attributes across R1 and R2 E.g., to obey constraint {Name, SSN}, R1 Name, R2 SSN Use tuple IDs for reassembly. R = R1 JOIN R2 For each value v, construct random bit seq. r R1 v XOR r, R2 r R1 EK (v) R2 K Can detect equality and push selections with equality predicate Encoding One-time Pad Deterministic Encryption Random addition R1 v+r , R2 r Can push aggregate SUM 92 Example An Employee relation: {Name, DoB, Position, Salary, Gender, Email, Telephone, ZipCode} Privacy Constraints {Telephone}, {Email} {Name, Salary}, {Name, Position}, {Name, DoB} {DoB, Gender, ZipCode} {Position, Salary}, {Salary, DoB} Will use just Vertical Fragmentation and Encoding. 93 Example (2) R1 Constraints {Telephone} {Email} {Name, Salary} {Name, Position} {Name, DoB} {DoB, Gender,ZipCode} {Position, Salary} {Salary, DoB} Salary ID Name DoB Position Salary Gender Email Telephone ZipCode ID R2 94 Partitioning, Execution Partitioning Problem Partition to minimize communication cost for given workload Even simplified version hard to approximate Hill Climbing algorithm after starting with weighted set cover Query Reformulation and Execution Consider only centralized plans Algorithm to partition select and where clause predicates between the two partitions 95 Thank You! 99 Acknowledgements: Stanford Faculty Advisor: Rajeev Motwani Members of Orals Committee: Rajeev Motwani, Hector Garcia-Molina, Dan Boneh, John Mitchell, Ashish Goel Many other professors at Stanford, esp. Jennifer Widom 100 Acknowledgements: Projects STREAM: Jennifer Widom, Rajeev Motwani PORTIA: Hector Garcia-Molina, Rajeev Motwani, Dan Boneh, John Mitchell TRUST: Dan Boneh, John Mitchell, Rajeev Motwani, Hector Garcia-Molina RAIN: Rajeev Motwani, Ashish Goel, Amin Saberi 101 Acknowledgements: Internship Mentors Rakesh Agrawal, Ramakrishnan Srikant, Surajit Chaudhuri, Nicolas Bruno, Phil Gibbons, Sachin Lodha, Anand Rajaraman 102 Acknowledgements: CoAuthors[A-K] Gagan Aggarwal, Rakesh Agrawal, Arvind Arasu, Brian Babcock, Shivnath Babu, Mayank Bawa, Nicolas Bruno, Renato Carmo, Surajit Chaudhuri, Mayur Datar, Prasenjit Das, A A Diwan, Tomás Feder, Vignesh Ganapathy, Prasanna Ganesan, Hector Garcia-Molina, Keith Ito, Krishnaram Kenthapadi, Samir Khuller, Yoshiharu Kohayakawa, 103 Acknowledgements: CoAuthors[L-Z] Eduardo Sany Laber, Sachin Lodha, Nina Mishra, Rajeev Motwani, Shubha Nabar, Itaru Nishizawa, Liadan Boyen, Rina Panigrahy, Nikhil Patwardhan, Ramakrishnan Srikant, Utkarsh Srivastava, S. Sudarshan, Sharada Sundaram, Rohit Varma, Jennifer Widom, Ying Xu, An Zhu 104 Acknowledgements: Others not in previous list Aristides, Gurmeet, Aleksandra, Sergei, Damon, Anupam, Arnab, Aaron, Adam, Mukund, Vivek, Anish, Parag, Vijay, Piotr, Moses, Sudipto, Bob, David, Paul, Zoltan etc. Members of Rajeev’s group, Stanford Theory, Database, Security groups, Also many PhD students of the incoming year 2002 -- Paul etc. and many other students at Stanford Lynda, Maggie, Wendy, Jam, Kathi, Claire, Meredith for administrative help Andy, Miles, Lilian for keeping the machines running! Various outing clubs and groups at Stanford, Catholic community here, SIA, RAINS groups, Ivgrad, DB Movie and Social Committee 105 Acknowledgements: More! Jojy Michael, Joshua Easow and families Roommates: Omkar Deshpande, Alex Joseph, Mayur Naik, Rajiv Agrawal, Utkarsh Srivastava, Rajat Raina, Jim Cybluski, Blake Blailey Batchmates and Professors from IITs Friends and relatives, grandparents sister Dina, and Parents 106 Data Streams Traditional DBMS – data stored in finite, persistent data sets New Applications – data input as continuous, ordered data streams Network and traffic monitoring Telecom call records Network security Financial applications Sensor networks Web logs and clickstreams Massive data sets 107 Scheduling Algorithms for Data Streams Minimizing the overhead over the disk system. Motwani, Thomas. SODA 2004 Operator Scheduling in Data Stream Systems – Minimizing memory consumption and latency. Babu, Babcock, Datar, Motwani, Thomas. VLDB Journal 2004 Stanford STREAM Data Manager. Stanford Stream Group. IEEE Bulletin 2003 108