Two Can Keep a Secret: A Distributed Architecture for Secure Database Services Gagan Aggarwal, Mayank Bawa, Prasanna Ganesan, Hector Garcia-Molina, Krishnaram Kenthapadi, Rajeev Motwani, Utkarsh Srivastava, Dilys Thomas, Ying Xu Stanford University 1 Motivation Data outsourcing growing in popularity – Cheap, reliable data storage and management Privacy concerns looming ever larger – High-profile thefts (often insiders) – Govt. legislation, e.g., California SB 1386 The Cure: Secure Database Service [KC04] – Outsource to Database Service Provider (DSP) – …but DSP cannot “see” data 2 The Crypto Approach [HILM02, HIM04,AKSX04] Client Query Q Answer Encrypt Client-side Processor DSP Q’ “Relevant Data” Problem: Q’ “SELECT *” 3 The Power of Two Client DSP1 DSP2 4 The Power of Two Query Q Q1 DSP1 Q2 DSP2 Client-side Processor Key: Ensure Cost (Q1)+Cost (Q2) Cost (Q) 5 Agenda Defining Privacy Requirements Tools for Database Decomposition Example Finding a “good” decomposition Query reformulation and execution Open Questions 6 Privacy according to SB 1386 ”…first name or first initial and last name in combination with any one or more of the following data elements, when either the name or the data elements are not encrypted: (1) Social Security Number. (2) Driver’s license number or California Identification Card number. (3) Account number, credit or debit card number, in combination with any required security code…” 7 In the language of set theory… { Name, SSN}, { Name, LicenceNo} { Name, CaliforniaID} { Name, AccountNumber} { Name, CreditCardNo, SecurityCode} are all to be kept private. A set is private if at least one of its elements is “hidden”. – Element in encrypted form ok 8 Defining A Private Decomposition R1 R R2 Given a set of privacy constraints – Each constraint is a set of attributes Adversary knows either R1 or R2 – Insider: views all data, queries at site Ensure for each constraint: – At least one attribute is “opaque” to adversary – i.e., neither R1 nor R2 exposes all attributes – We won’t define “opaque” 9 Relation Decomposition Charter “Bury them nature will – Break upand “universal” relation R into R1 and R2 care of the rest” –take Lossless, privacy-preserving decomposition – Note: Restriction to “relational” algebra Tools of the Trade – – – – Fragmentation Encoding Semantic Attribute Decomposition Noise 10 Tools of the Trade: Fragmentation Horizontal Fragmentation – R = R1 U R2 – Not too exciting (yet?) Vertical Fragmentation – Partition attributes across R1 and R2 – E.g., to obey constraint {Name, SSN}, R1 Name, R2 SSN – Use tuple IDs for reassembly. R = R1 JOIN R2 11 Tools of the Trade: Encoding Encode attribute across both R1 and R2 – Need both parts to reconstruct – Why? Sensitive attributes, e.g., Email – Different options with privacy vs. query cost (computation, communication) trade-offs E.g., One-time Pad – – – – For each value v, construct random bit seq. r R1 v XOR r, R2 r Reconstruction: (v XOR r) XOR r = v Perfect privacy, Expensive? 12 Tools of the Trade: Encoding (2) Deterministic Encryption – R1 EK (v) R2 K – Leaks information. E.g., can detect equality – Can push selections with equality predicate Random addition – R1 v+r , R2 r – Can push aggregate SUM – Problem: Information leak (what is “opaque”?) 13 More Tools of the Trade Semantic Attribute Decomposition – Extract “public” data from private attrs. – E.g., Area code of PhoneNo, Domain name of Email – Useful for filtering selections, aggregates Adding Noise – Add “dangling tuples” to R1 and R2 14 Example An Employee relation: {Name, DoB, Position, Salary, Gender, Email, Telephone, ZipCode} Privacy Constraints – – – – {Telephone}, {Email} {Name, Salary}, {Name, Position}, {Name, DoB} {DoB, Gender, ZipCode} {Position, Salary}, {Salary, DoB} Will use just Vertical Fragmentation and Encoding. 15 Example (2) Constraints {Telephone} {Email} {Name, Salary} {Name, Position} {Name, DoB} {DoB, Gender,ZipCode} {Position, Salary} {Salary, DoB} R1 Salary ID Name DoB Position Salary Gender Email Telephone ZipCode ID R2 16 Finding a “Good” Decomposition Find a decomposition that – Obeys all privacy constraints – Minimizing execution cost for given workload Complicated optimization problem – After 3 layers of simplification, NP-hard to approximate! Multiple heuristics based on min-cuts and set cover – E.g., (1) Find lots of “efficient” decompositions (2) Modify to obey constraints and pick the best 17 Query Reformulation and Execution Overview – Take original query Q on R – Rewrite to get Q1(R1) and Q2(R2) – Combine results Key idea – – – – Take query plan for Q Replace R by R1 JOIN R2 Push down selects, projects, aggregates Partition plan into two Space of plans – Different ways to do joins – symmetric, semi-joins – Note: Q2 can depend on result of Q1 18 Open Questions Does the idea work? Compare efficiency to an encryption-based scheme – Need evaluation methodology Generalize decomposition – Deal with multiple relations, functional dependencies, normal forms – Allow attribute replication Expand space of decompositions – We only considered simple encoding and vert. fragmentation 19 Open Questions (2) Details of the 2-database architecture – How much functionality ends up on client side? – How to handle other DB functions? Access Control? Constraint checking? How about two logical DBs with disjoint administration? 20