Privacy in Statistical Databases University of Texas at El Paso



University of Texas at El Paso

Privacy in Statistical Databases

Dr. Luc Longpré

Computer Science Department

Spring 2006

1 Computer Science Dept.


Database with

Confidential Information

• Examples:

– census data

– medical information

• Privacy: protect the confidentiality of individuals

• Usefulness: want to derive meaningful statistics

2 Computer Science Dept.


The Need for

Privacy Safeguards

• Per person available disk space:

– 1983: 0.02Mb

– 1996: 28Mb

– 2000: 472Mb

• Equivalent of one page per 3 minutes of life

3 Computer Science Dept.


The Need for

Privacy Safeguards

• Misuse of personal health information:

– banker cross-referencing cancer patients with outstanding loans

– using medical records to make decisions about employees

– snooping in hospital computer network

– 40% of insurers disclose personal health information to lenders, employers, marketers, without customer permission

4 Computer Science Dept.



• Access control, encryption:

– Only fixes who has access to what

– Does not protect disclosures based on inference

• Problem

– Sometimes it may be possible to derive confidential information from released information

5 Computer Science Dept.



• Salary database

• Query: what’s the average salary of white male professors with 2 children living El

Paso Texas since 1994 and in Boston from

1987 to 1994?

6 Computer Science Dept.



• 87% of population of the US are unique under ID made of:

– 5 digit ZIP,

– gender,

– date of birth

7 Computer Science Dept.


Linking to Re-Identify Data

• Medical database:

– Ethnicity, visit date, diagnosis, procedure, medication, ZIP, Birth date, Sex

• Voter list:

– Name, address, date registered, ZIP, Birth date,


8 Computer Science Dept.


Statistical Database

• Data collected with the purpose of releasing statistical information.

• Important for research, policy

• Facing tremendous demand for personspecific data

– data mining, fraud detection, homeland security

9 Computer Science Dept.


Sample Size

• Possible solution: do not release any statistics on any set of less than, say,10 records

10 Computer Science Dept.


Problem Remains

• Query 1: What’s the average salary of every male age 89 in zip code 79912?

• Query 2: What’s the average salary of people age 89 in zip code 79912?

11 Computer Science Dept.



• Release only information where at least k records are identical (work by Sweeney)

• Attacks are still possible:

– Unsorted matching: use the order of records

• solution: randomize order

12 Computer Science Dept.



– Complementary release: combining k-anonymous releases may not be kanonymous

• solution: consider all releases together

– Temporal attack: data is dynamic, adding and removing data affects k-anonymous properties

• solution: analyze k-anonymous properties of dynamic data

13 Computer Science Dept.


Other Solutions

• Add noise in the answers

• Add noise in the data

• Limit the kinds of queries allowed to the statistical database

14 Computer Science Dept.


Quantifying Information

• Need a formal model, possibly based on information theory

• Measure entropy in database records before and after a statistical release

15 Computer Science Dept.


Further Complications

• Some data is more sensitive than others

– Example: bits in salary

• Common knowledge, information from other databases

– Could define entropy conditional to available information

– Very impractical in applications

• Some people know some of the records

16 Computer Science Dept.

Non Additivity


• Data sensitivity is non additive

– Ex: don’t mind either digit of SSN to be released, but not all digits

• Privacy loss is non additive

– Ex: There could be 2 sets of information, each of which, if released, gives no information, but which, if together released, reveals all the information

17 Computer Science Dept.


Past Research

• Denning: “Cryptography and data security”,


• Sweeney: Ph.D. thesis, Applications to medical data, 1996

• A few more stray results, topics becoming popular again in “privacy preserving data mining”.

18 Computer Science Dept.
