UTEP
Dr. Luc Longpré
Computer Science Department
Spring 2006
1 Computer Science Dept.
UTEP
• Examples:
– census data
– medical information
• Privacy: protect the confidentiality of individuals
• Usefulness: want to derive meaningful statistics
2 Computer Science Dept.
UTEP
• Per person available disk space:
– 1983: 0.02Mb
– 1996: 28Mb
– 2000: 472Mb
• Equivalent of one page per 3 minutes of life
3 Computer Science Dept.
UTEP
• Misuse of personal health information:
– banker cross-referencing cancer patients with outstanding loans
– using medical records to make decisions about employees
– snooping in hospital computer network
– 40% of insurers disclose personal health information to lenders, employers, marketers, without customer permission
4 Computer Science Dept.
UTEP
• Access control, encryption:
– Only fixes who has access to what
– Does not protect disclosures based on inference
• Problem
– Sometimes it may be possible to derive confidential information from released information
5 Computer Science Dept.
UTEP
• Salary database
• Query: what’s the average salary of white male professors with 2 children living El
Paso Texas since 1994 and in Boston from
1987 to 1994?
6 Computer Science Dept.
UTEP
• 87% of population of the US are unique under ID made of:
– 5 digit ZIP,
– gender,
– date of birth
7 Computer Science Dept.
UTEP
• Medical database:
– Ethnicity, visit date, diagnosis, procedure, medication, ZIP, Birth date, Sex
• Voter list:
– Name, address, date registered, ZIP, Birth date,
Sex
8 Computer Science Dept.
UTEP
• Data collected with the purpose of releasing statistical information.
• Important for research, policy
• Facing tremendous demand for personspecific data
– data mining, fraud detection, homeland security
9 Computer Science Dept.
UTEP
• Possible solution: do not release any statistics on any set of less than, say,10 records
10 Computer Science Dept.
UTEP
• Query 1: What’s the average salary of every male age 89 in zip code 79912?
• Query 2: What’s the average salary of people age 89 in zip code 79912?
11 Computer Science Dept.
UTEP
• Release only information where at least k records are identical (work by Sweeney)
• Attacks are still possible:
– Unsorted matching: use the order of records
• solution: randomize order
12 Computer Science Dept.
UTEP
– Complementary release: combining k-anonymous releases may not be kanonymous
• solution: consider all releases together
– Temporal attack: data is dynamic, adding and removing data affects k-anonymous properties
• solution: analyze k-anonymous properties of dynamic data
13 Computer Science Dept.
UTEP
• Add noise in the answers
• Add noise in the data
• Limit the kinds of queries allowed to the statistical database
14 Computer Science Dept.
UTEP
• Need a formal model, possibly based on information theory
• Measure entropy in database records before and after a statistical release
15 Computer Science Dept.
UTEP
• Some data is more sensitive than others
– Example: bits in salary
• Common knowledge, information from other databases
– Could define entropy conditional to available information
– Very impractical in applications
• Some people know some of the records
16 Computer Science Dept.
UTEP
• Data sensitivity is non additive
– Ex: don’t mind either digit of SSN to be released, but not all digits
• Privacy loss is non additive
– Ex: There could be 2 sets of information, each of which, if released, gives no information, but which, if together released, reveals all the information
17 Computer Science Dept.
UTEP
• Denning: “Cryptography and data security”,
1982
• Sweeney: Ph.D. thesis, Applications to medical data, 1996
• A few more stray results, topics becoming popular again in “privacy preserving data mining”.
18 Computer Science Dept.