295d-intro

advertisement
Privacy in Data Management
Sharad Mehrotra
1
Privacy - definitions
Generic
-
Privacy is the interest that individuals have in sustaining a 'personal
space', free from interference by other people and organizations.
Information Privacy
-
The degree to which an individual can determine which personal
information is to be shared with whom and for what purpose.
-
The evolving relationship between technology and the legal right to, or
public expectation of privacy in the collection and sharing of data
Identity privacy (anonymity)
-
Anonymity of an element (belonging to a set) refers to the property of
that element of not being identifiable within the set, i.e., being
indistinguishable from the other elements of the set
2
Means of achieving privacy
Information Security is the process of protecting data from unauthorized
access, use, disclosure, destruction, modification, or disruption.
Enforcing security in information processing applications:
1.
2.
3.
4.
Law
Access control
Data encryption
Data transformation – statistical disclosure control
Techniques used depend on
-
Application semantics/functionality requirements
Nature of data
Privacy requirement/metrics
Privacy is contextual
3
Overview
Study the nature of privacy in context of
data-centric applications
1. Privacy-preserving data publishing for data mining
applications
2. Secure outsourcing of data: “Database as A Service
(DAS)”
3. Privacy-preserving implementation of pervasive
spaces
4. Secure data exchange and sharing between multiple
parties
4
Privacy-Preserving / Anonymmized
Data Publishing
5
Why Anonymize?
 For Data Sharing
 Give real(istic) data to others to study without
compromising privacy of individuals in the data
 Allows third-parties to try new analysis and mining
techniques not thought of by the data owner
 For Data Retention and Usage
 Various requirements prevent companies from retaining
customer information indefinitely
 E.g. Google progressively anonymizes IP addresses in
search logs
 Internal sharing across departments (e.g. billing 
marketing)
6
Why Privacy?
 Data subjects have inherent right and expectation of privacy
 “Privacy” is a complex concept (beyond the scope of this tutorial)
 What exactly does “privacy” mean? When does it apply?
 Could there exist societies without a concept of privacy?
 Concretely: at collection “small print” outlines privacy rules
 Most companies have adopted a privacy policy
 E.g. AT&T privacy policy att.com/gen/privacy-policy?pid=2506
 Significant legal framework relating to privacy
 UN Declaration of Human Rights, US Constitution
 HIPAA, Video Privacy Protection, Data Protection Acts
7
Case Study: US Census
 Raw data: information about every US household
 Who, where; age, gender, racial, income and educational data
 Why released: determine representation, planning
 How anonymized: aggregated to geographic areas (Zip code)
 Broken down by various combinations of dimensions
 Released in full after 72 years
 Attacks: no reports of successful deanonymization
 Recent attempts by FBI to access raw data rebuffed
 Consequences: greater understanding of US population
 Affects representation, funding of civil projects
 Rich source of data for future historians and genealogists
8
Case Study: Netflix Prize
 Raw data: 100M dated ratings from 480K users to 18K movies
 Why released: improve predicting ratings of unlabeled examples
 How anonymized: exact details not described by Netflix
 All direct customer information removed
 Only subset of full data; dates modified; some ratings deleted,
 Movie title and year published in full
 Attacks: dataset is claimed vulnerable [Narayanan Shmatikov 08]
 Attack links data to IMDB where same users also rated movies
 Find matches based on similar ratings or dates in both
 Consequences: rich source of user data for researchers
 unclear if attacks are a threat—no lawsuits or apologies yet
9
Case Study: AOL Search Data
 Raw data: 20M search queries for 650K users from 2006
 Why released: allow researchers to understand search patterns
 How anonymized: user identifiers removed
 All searches from same user linked by an arbitrary identifier
 Attacks: many successful attacks identified individual users
 Ego-surfers: people typed in their own names
 Zip codes and town names identify an area
 NY Times identified 4417749 as 62yr old GA widow [Barbaro Zeller
06]
 Consequences: CTO resigned, two researchers fired
 Well-intentioned effort failed due to inadequate anonymization
10
Three Abstract Examples
 “Census” data recording incomes and demographics
 Schema: (SSN, DOB, Sex, ZIP, Salary)
 Tabular data—best represented as a table
 “Video” data recording movies viewed
 Schema: (Uid, DOB, Sex, ZIP), (Vid, title, genre), (Uid, Vid)
 Graph data—graph properties should be retained
 “Search” data recording web searches
 Schema: (Uid, Kw1, Kw2, …)
 Set data—each user has different set of keywords
 Each example has different anonymization needs
11
Models of Anonymization
 Interactive Model (akin to statistical databases)
 Data owner acts as “gatekeeper” to data
 Researchers pose queries in some agreed language
 Gatekeeper gives an (anonymized) answer, or refuses to
answer
 “Send me your code” model
 Data owner executes code on their system and reports result
 Cannot be sure that the code is not malicious
 Offline, aka “publish and be damned” model
 Data owner somehow anonymizes data set
 Publishes the results to the world, and retires
 Our focus in this tutorial – seems to model most real releases
12
Objectives for Anonymization
 Prevent (high confidence) inference of associations
 Prevent inference of salary for an individual in “census”
 Prevent inference of individual’s viewing history in “video”
 Prevent inference of individual’s search history in “search”
 All aim to prevent linking sensitive information to an individual
 Prevent inference of presence of an individual in the data set
 Satisfying “presence” also satisfies “association” (not vice-versa)
 Presence in a data set can violate privacy (eg STD clinic patients)
 Have to model what knowledge might be known to attacker
 Background knowledge: facts about the data set (X has salary Y)
 Domain knowledge: broad properties of data (illness Z rare in
men)
13
Utility
 Anonymization is meaningless if utility of data not
considered
 The empty data set has perfect privacy, but no utility
 The original data has full utility, but no privacy
 What is “utility”? Depends what the application is…
 For fixed query set, can look at max, average
distortion
 Problem for publishing: want to support unknown
applications!
 Need some way to quantify utility of alternate
anonymizations
14
Measures of Utility
 Define a surrogate measure and try to optimize
 Often based on the “information loss” of the anonymization
 Simple example: number of rows suppressed in a table
 Give a guarantee for all queries in some fixed class
 Hope the class is representative, so other uses have low
distortion
 Costly: some methods enumerate all queries, or all
anonymizations
 Empirical Evaluation
 Perform experiments with a reasonable workload on the result
 Compare to results on original data (e.g. Netflix prize problems)
 Combinations of multiple methods
 Optimize for some surrogate, but also evaluate on real queries
15
Definitions of Technical Terms
 Identifiers–uniquely identify, e.g. Social Security Number (SSN)
 Step 0: remove all identifiers
 Was not enough for AOL search data
 Quasi-Identifiers (QI)—such as DOB, Sex, ZIP Code
 Enough to partially identify an individual in a dataset
 DOB+Sex+ZIP unique for 87% of US Residents [Sweeney 02]
 Sensitive attributes (SA)—the associations we want to hide
 Salary in the “census” example is considered sensitive
 Not always well-defined: only some “search” queries sensitive
 In “video”, association between user and video is sensitive
 SA can be identifying: bonus may identify salary…
16
Summary of Anonymization Motivation
 Anonymization needed for safe data sharing and retention
 Many legal requirements apply
 Various privacy definitions possible
 Primarily, prevent inference of sensitive information
 Under some assumptions of background knowledge
 Utility of the anonymized data needs to be carefully studied
 Different data types imply different classes of query
17
Privacy issues in data outsourcing (DAS) and
cloud computing applications
18
Motivation
19
20
21
22
Example: DAS - Secure outsourcing of data management
DB
Internet
Data owner/Client
Issues:
Server
Service Provider

Confidential information in data needs to be protected

Features – support queries on data: SQL, keyword based
search-queries, XPath queries etc.

Performance - Bulk of work to be done on server, reduce
communication overhead, client-side storage and postprocessing of solutions.
23
Security model for DAS applications
Adversaries (A):


Inside attackers: authorized users with malicious intent
Outside attackers: hackers, snoopers
Attack models:
 Passive attacks: A wants to learn confidential information
 Active attacks: A wants to learn confidential information + actively
modifies data and/or queries
Trust on server:
 Untrusted: normal hardware, data & computation visible
 Semi-trusted: trusted co-processors + limited storage
 Trusted: All hardware is trusted & tamper-proof
24
Secure data storage & querying in DAS
DB
Internet
Service
Provider
Server
Data owner/Client
R
ssn
name
credit
rating
salary
age
780
John
bad
34K
32
876
Mary
good
29K
40
:
:
:
:
Security concern: “ssn”
“salary” & “credit rating” is confidential
How to execute queries on encrypted data?
Encrypt the sensitive
column values
e.g. Select * from R where salary  [25K, 35K]
Trivial solution: retrieve all rows to client,
decrypt them and check for predicate
We can do better
Use secure indices for
query evaluation on server
25
Data storage
• Encrypt the rows
• Partition salary values into buckets
Client side meta-data
• Index the etuples by their bucket-labels
buckets
B0
0
B1
20
B2
30
B3
40
50
Server side data
RS :Server side Table (encrypted + indexed)
R: Original Table (plain text)
ssn
name
sex
credit
rating
sal
age
345
Tom
Male
Bad
34k
32
876
Mary
Female
Good
29k
40
234
Jerry
Male
Good
45k
34
780
John
Male
Bad
39k
33
Encrypt
etuple
bucket
(^#&*%T%&4&7ERGTty^Q!%^&*
B2
&^$^G@UG^g&@^&&#G@@#(GW
B1
&*#($T%#$@$R@@$#@^FG$%&
B3
&*#($T%#$@$R@@$#@^FG$%&
B2
26
Querying encrypted data
Select * from R where sal  [25K, 35K]
Client side data
Client-side query
B0
0
Server-side query
B1
20
B2
30
Select etuple from RS where bucket = B1 ∨ B2
B3
40
50
False positive
Server side Table (encrypted + indexed)
Client side Table (plain text)
RS
R
ssn
name
sex
credit
rating
sal
age
345
Tom
Male
Bad
34k
32
876
Mary
Female
Good
29k
40
234
Jerry
Male
Good
45k
34
780
John
Male
Bad
39k
33
etuple
bucket
(^#&*%T%&4&7ERGTty^Q!%^&*
B2
&^$^G@UG^g&@^&&#G@@#(GW
B1
&*#($T%#$@$R@@$#@^FG$%&
B3
&*#($T%#$@$R@@$#@^FG$%&
B2
27
Problems to address
 Security analysis
 Goal: To hide away the confidential information in data
from server-side adversaries (DB admins etc.)
 Quantitative measures of disclosure-risk
 Quality of partitioning (bucketization)
 Data partitioning schemes
 Cost measures
 Tradeoff
 Balancing the two competing goals of security &
performance
Continued later …
28
Privacy in Cloud Computing
 What is cloud computing?


Many definition exist
Cloud computing is a model for enabling convenient, on-demand
network access to a shared pool of configurable computing
resources (e.g., networks, servers, storage, applications, and
services) that can be rapidly provisioned and released with minimal
management effort or service provider interaction. [NIST]

Clouds are a large pool of easily usable and accessible virtualized
resources (such as hardware, development platforms and/or
services). These resources can be dynamically re-configured to
adjust to a variable load (scale), allowing also for an optimum
resource utilization. This pool of resources is typically exploited by
a pay-per-use model in which guarantees are offered by the
Infrastructure Provider by means of customized service-level
agreements. [Luis M. Vaquero et al., Madrid Spain]
29
Privacy in Cloud Computing
 Actors

Service Providers


Service Users


Provide software services (Ex: Google, Yahoo, Microsoft, IBM, etc…)
Personal, business, government
Infrastructure Providers

Provide the computing infrastructure required to host services
 Three cloud services

Cloud Software as a Service (SaaS)


Cloud Platform as a Service (PaaS)


Use provider’s applications over a network
Deploy customer-created applications to a cloud
Cloud Infrastructure as a Service (IaaS)

Rent processing, storage, network capacity, and other fundamental
computing resources
30
Privacy in Cloud Computing
 Examples of cloud computing services






Web-based email
Photo storing
Spreadsheet applications
File transfer
Online medical record storage
Social network applications
31
Privacy in Cloud Computing
 Privacy issues in cloud computing


Cloud increases security and privacy risks
Data





Create enormous risks for data privacy



Creation, storage, communication – exponential rate
Data replicated across large geographic distances
Data contain personal identifiable information
Data stored at untrusted hosts
lost of control of sensitive data
Risk of sharing sensitive data with marketing
Other problem: technology ahead of law




Does the user or the hosting company own the data?
Can the host deny a user access to their own data?
If the host company goes out of business, what happens to the
users' data it holds?
How does the host protect the user's data?
32
Privacy in Cloud Computing
 Solutions



The cloud does not offer any privacy
Awareness
Some effort
 Effort


ACM Cloud Computing Security Workshop, November, 2009
ACM Symposium on Cloud Computing, June, 2010
 Privacy in cloud computing at UCI


Recently lunched a project on privacy-preservation in cloud
computing
General approach: personal privacy middleware
33
Privacy preservation in Pervasive Spaces
34
Privacy in data sharing and exchange
40
Extra material
41
Example: Detecting a pre-specified set of events

No ordinary coffee room, one
that is monitored !

There are rules that apply

If rule is violated, penalties
may be imposed

But all is not unfair:
individuals have right to
privacy !
Just like a
coffee room !!
”Till an individual has not had
more than his quota of
coffee, his identity will not
be revealed”
42
Issues to be addressed
 Modeling pervasive spaces: How to capture events of interest
 E.g., “Tom had his 4th cup of coffee for the day”
 Privacy goal: Guarantee anonymity to individuals
 What are the necessary and sufficient conditions?
 Solution
 Design should satisfy the necessary and sufficient
conditions
 Practical/scalable
43
Basic events, Composite events & Rules

Model of pervasive space:
Pervasive
Space with
sensors

Composite event: one or more
sequence of basic events

Rule: (Composite event, Action)

Rules apply to groups of
individuals, e.g.:
 Coffee room rules apply to
everyone
 Server room rule applies to
everyone except
administrators etc.
Stream of basic events
A stream of basic events
:
:
ek:<Bill, coffee-room, coffee-maker, exit>
:
:
e2:<Tom, coffee-room, coffee-cup, dispense>
e1:<Tom, coffee-room, *, enter>
44
Composite-events & automaton templates
Composite-event templates

“A student drinks more than 3 cups
of coffee”
e1 ≡ <u ∈ STUDENT, coffee_room,
coffee_cup, dispense>

¬e1
S0
e1
¬e1
e1
1
“A student tries to access the IBM
machine in the server room”
e1 ≡ <u ∈ STUDENT,server_room,*, entry>
e2 ≡ <ū, server_room, *, exit>
e3 ≡ <ū, server_room, IBM-mc, loginattempt>
¬e1
2
e1
3
e1
¬(e3 V e2)
S0
e1
1
e3
SF
e2
45
SF
System architecture & adversary
Secure
Sensor node
(SSN)
Secure
Sensor node
(SSN)
Server
Rules DB
State
Information
¬e1
S0
S0
S0
Basic Assumptions about SSNs
e1
¬e1
e1
¬e1
e1
¬e1
1
1
1
e1
¬e1
e1
¬e1
e1
¬e1
2
2
2
e1
¬e1
e1
¬e1
e1
3
3
3
e1
e1
e1
SF
SF
SF
Thin trusted middleware to
obfuscate origin of events
 Trusted hardware (Sensors are tamper-proof)
 Secure data capture & generation of basic events by SSN
 Limited computation + storage capacity: can carry out encryption/decryption
with secret key common to all SSNs, automaton transition
46
Privacy goal & Adversary’s knowledge
Ensure k-anonymity for each individual
(k-anonymity is achieved when each individual is indistinguishable from at
least k-1 other individuals associated with the space )
Passive adversary (A): Server-side snooper who wants to deduce the
identity of the individual associated with a basic-event



A knows all rules of the space & automaton structures
A can observe all server-side activities
A has unlimited computation power
Minimum requirement to ensure anonymity:
State information (automatons) are always kept encrypted on server
47
Basic protocol
SERVER
Return automatons
that (possibly) match e
(encrypted match)
Store updated
automatons
SECURE
SENSOR NODE
Generate basic
event e
Encrypted query
for automatons
that make
transition on e
Decrypt automatons,
advance the state of
automatons if
necessary
associate encrypted
label with new state.
Write-back encrypted
automatons
Question: Does encryption ensure anonymity?
NO!  pattern of automaton access may reveal identity
48
Example
R1
R2
R3
U enters
kitchen
U takes
coffee
U enters
kitchen
U opens
fridge
Applies to Tom
Tom enters Kitchen  3 firings
U enters
kitchen
U opens
microwave
Applies to Bill
Bill enters Kitchen  2 firings
R1
R2
U enters
kitchen
U takes
coffee
U enters
kitchen
U opens
fridge
On an event, the # rows retrieved from state table can disclose the
identity of the individual
49
Characteristic access patterns of automatons
The set of rules applicable to an individual maybe unique  potentially
identify the individual
x
Tom enters
kitchen
Tom takes
coffee
Rules applicable
to TOM
Characteristic patterns of x
P1: {x,y,z} {x y}
y
Tom leaves
coffee pot
empty
Tom enters
kitchen
Tom leaves
fridge open
z
Tom enters
kitchen
Tom opens
fridge
Characteristic patterns of y
P2: {x,y,z} {x,y} {y}
P3: {x,y,z} {y,z} {y}
Characteristic patterns of z
P4: {x,y,z} {y z}
The characteristic access patterns of rows can potentially
reveal the identity of the automaton in spite of encryption
50
Solution scheme

Formalized the notion of indistinguishability of automatons in terms of their
access patterns

Identified “event clustering” as a mechanism for inducing indistinguishability
for achieving k-anonymity


Proved the difficulty of checking for k-anonymity
Characterized the class of event-clustering schemes that achieve kanonymity

Proposed an efficient clustering algorithm to minimize average execution
overhead for protocol

Implemented a prototype system

Challenges:
 Designing a truly secure sensing-infrastructure is challenging
 Key management issues
 Are there other interesting notions of privacy in pervasive space?
51
Download