Final_Presentation_on_SPIM - Decentralized Information Group

advertisement
Differential Privacy on
Linked Data: Theory
and Implementation
Yotam Aron
Table of Contents
• Introduction
• Differential Privacy for Linked Data
• SPIM implementation
• Evaluation
Contributions
• Theory: how to apply differential privacy to
linked data.
• Implementation: privacy module for SPARQL
queries.
• Experimental evaluation: differential privacy on
linked data.
Introduction
Overview: Privacy Risk
• Statistical data can leak privacy.
• Mosaic Theory: Different data sources harmful
when combined.
• Examples:
• Netflix Prize Data set
• GIC Medical Data set
• AOL Data logs
• Linked data has added ontologies and meta-data,
making it even more vulnerable.
Current Solutions
• Accountability:
• Privacy Ontologies
• Privacy Policies and Laws
• Problems:
• Requires agreement among parties.
• Does not actually prevent breaches, just a deterrent.
Current Solutions (Cont’d)
• Anonymization
• Delete “private” data
• K – anonymity (Strong Privacy Guarantee)
• Problems
•
•
•
•
Deletion provides no strong guarantees
Must be carried out for every data set
What data should be anonymized?
High computational cost (k-anonimity is np-hard)
Differential Privacy
• Definition for relational databases (from
PINQ paper):
A randomized function K gives Ɛ-differential privacy if for all data
sets 𝐷1 and 𝐷2 differing on at most one record, and all 𝑆 ⊆
π‘…π‘Žπ‘›π‘”π‘’(𝐾),
Pr[𝐾 𝐷1 ∈ 𝑆] ≤ exp πœ– × Pr[𝐾 𝐷2 ∈ 𝑆]
Differential Privacy
• What does this mean?
• Adversaries get roughly same results from 𝐷1 and 𝐷2 ,
meaning a single individual’s data will not greatly
affect their knowledge acquired from each data set.
How Is This Achieved?
• Add noise to result.
• Simplest: Add Laplace noise
Laplace Noise Parameters
• Mean = 0 (so don’t add bias)
• Variance =
βˆ†π‘„
,
πœ–
where βˆ†π‘„ is defined, for a record j, as
π‘€π‘Žπ‘₯𝑗 (|𝐹 𝐷 − 𝐹 𝐷 − 𝑗 |)
• Theorem: For query Q result R, the output R + Laplace(0,
is differentially private.
βˆ†π‘„
)
πœ–
Other Benefit of Laplace Noise
• A set of queries each with sensitivity πœ€π‘– will have an overall
sensitivity of πœ€π‘–
• Implementation-wise, can allocate an “budget” Ɛ for a client
and for each query client specifies πœ€π‘– < πœ€ to use.
Benefits of Differential Privacy
•
•
•
•
Strong Privacy Guarantee
Mechanism-Based, so don’t have to mess with data.
Independent of data set’s structure.
Works well with for statistical analysis algorithms.
Problems with Differential
Privacy
• Potentially poor performance
• Complexity
• Noise
• Only works with statistical data (though this has fixes)
• How to calculate sensitivity of arbitrary query without bruteforce?
Theory: Differential Privacy for
Linked Data
Differential Privacy and Linked
Data
• Want same privacy guarantees for linked data without, but no
“records.”
• What should be “unit of difference”?
• One triple
• All URIs related to person’s URI
• All links going out from person’s URI
Differential Privacy and Linked
Data
• Want same privacy guarantees for linked data without, but no
“records.”
• What should be “unit of difference”?
• One triple
• All URIs related to person’s URI
• All links going out from person’s URI
Differential Privacy and Linked
Data
• Want same privacy guarantees for linked data without, but no
“records.”
• What should be “unit of difference”?
• One triple
• All URIs related to person’s URI
• All links going out from person’s URI
Differential Privacy and Linked
Data
• Want same privacy guarantees for linked data without, but no
“records.”
• What should be “unit of difference”?
• One triple
• All URIs related to person’s URI
• All links going out from person’s
URI
“Records” for Linked Data
• Reduce links in graph to attributes
• Idea:
• Identify individual contributions from a single individual to total
answer.
• Find contribution that affects answer most.
“Records” for Linked Data
• Reduce links in graph to attributes, makes it a record.
P1
Knows
P2
Person
Knows
P1
P2
“Records” for Linked Data
• Repeated attributes and null values allowed
P1
Knows
P2
Loves
Knows
P3
P4
Knows
“Records” for Linked Data
• Repeated attributes and null values allowed (not good RDBMS form
but makes definitions easier)
Person
Knows
Knows
Loves
P1
P2
Null
P4
P3
P2
P4
Null
Query Sensitivity in Practice
• Need to find triples that “belong” to a person.
• Idea:
• Identify individual contributions from a single individual to total
answer.
• Find contribution that affects answer most.
• Done using sorting and limiting functions in SPARQL
Example
S1
• COUNT of places
visited
S2
P1
MA
State of Residence
P2
Visited
S3
Example
S1
• COUNT of places
visited
S2
P1
MA
State of Residence
P2
Visited
S3
Example
S1
• COUNT of places
visited
S2
P1
MA
State of Residence
P2
Visited
S3
Answer: Sensitivity of 2
Using SPARQL
• Query:
(COUNT(?s) as ?num_places_visited) WHERE{
?p :visited ?s }
Using SPARQL
• Sensitivity Calculation Query (Ideally):
SELECT ?p (COUNT(ABS(?s)) as ?num_places_visited)
WHERE{
?p :visited ?s;
?p foaf:name ?n }
GROUP BY ?p ORDER BY ?num_places_visited LIMIT 1
In reality…
• LIMIT, ORDER BY, GROUP BY doesn’t work together in 4store…
• For now: Don’t use LIMIT and get top answers manually.
• I.e. Simulate using these keywords in python
• Will affect results, so better testing should be carried out in the
future.
• Would like to keep it on sparql-side ideally so there is less
transmitted data (e.g. on large data sets)
(Side rant) 4store limitations
•
•
•
•
Many operations not supported in unison
E.g. cannot always filter and use “order by” for some reason
Severely limits the types of queries I could use to test.
May be desirable to work with a different triplestore that is
more up-to-date (ARQ).
• Didn’t because wanted to keep code in python.
• Also had already written all code for 4store
Problems with this Approach
• Need to identify “people” in graph.
• Assume, for example, that URI with a foaf:name is a person and
use its triples in privacy calculations.
• Imposes some constraints on linked data format for this to work.
• For future work, look if there’s a way to automatically identify
private data, maybe by using ontologies.
• Complexity is tied to speed of performing query over large
data set.
• Still not generalizable to all functions.
…and on the Plus Side
• Model for sensitivity calculation can be expanded to arbitrary
statistical functions.
• e.g. dot products, distance functions, variance, etc.
• Relatively simple to implement using SPARQL 1.1
Implementation: Design of
Privacy System
SPARQL Privacy Insurance
Module
• i.e. SPIM
• Use authentication, AIR, and differential privacy in one system.
• Authentication to manage Ɛ-budgets.
• AIR to control flow of information and non-statistical data.
• Differential privacy for statistics.
• Goal: Provide a module that can integrate into SPARQL 1.1
endpoints and provide privacy.
Design
OpenID
Authentication
Differential
Privacy
Module
HTTP Server
SPIM Main Process
Triplestore
User Data
AIR
Reasoner
Privacy
Policies
HTTP Server and
Authentication
HTTP Server
OpenID
Authentication
• HTTP Server: Django server
that handles http requests.
• OpenID Authentication: Django
module.
SPIM Main Process
• Controls flow of information.
• First checks user’s budget, then
uses AIR, then performs final
differentially-private query.
SPIM Main Process
AIR Reasoner
AIR
Reasoner
Privacy
Policies
• Performs access control by
translating SPARQL queries to
n3 and checking against
policies.
• Can potentially perform more
complicated operations (e.g.
check user credentials)
Differential Privacy Protocol
Client
Differential
Privacy Module
Scenario: Client wishes to make standard
SPARQL 1.1 statistical query. Client has Ɛ
“budget” of overall accuracy for all queries.
SPARQL
Endpoint
Differential Privacy Protocol
Client
Query,
Ɛ>0
Differential
Privacy Module
Step 1: Query and epsilon value sent to the
endpoint and intercepted by the enforcement
module.
SPARQL
Endpoint
Differential Privacy Protocol
Client
Differential
Privacy Module
Step 2: The sensitivity of the query is calculated
using a re-written, related query.
Sens
Query
SPARQL
Endpoint
Differential Privacy Protocol
Client
Step 3: Actual query sent.
Differential
Privacy Module
Query
SPARQL
Endpoint
Differential Privacy Protocol
Client
Result and
Noise
Differential
Privacy Module
Step 4: Result with Laplace noise sent over.
SPARQL
Endpoint
Experimental Evaluation
Evaluation
• Three things to evaluate:
• Correctness of operation
• Correctness of differential privacy
• Runtime
• Used an anonymized clinical database as the test data and
added fake names, social security numbers, and addresses.
Correctness of Operation
• Can the system do what we want?
• Authentication provides access control
• AIR restricts information and types of queries
• Differential privacy gives strong privacy guarantees.
• Can we do better?
Use Case Used in Thesis
• Clinical database data protection
• HIPAA: Federal protection of private information fields, such
as name and social security number, for patients.
• 3 users
• Alice: Works in CDC, needs unhindered access
• Bob: Researcher that needs access to private fields (e.g.
addresses)
• Charlie: Amateur researcher to whom HIPAA should apply
• Assumptions:
• Django is secure enough to handle “clever attacks”
• Users do not collude, so can allocate individual epsilon values.
Use Case Solution Overview
• What should happen:
• Dynamically apply different AIR policies at runtime.
• Give different epsilon-budgets.
• How allocated:
• Alice: No AIR Policy, no noise.
• Bob: Give access to addresses but hide all other private
information fields.
• Epsilon budget: E1
• Charlie: Hide all private information fields in accordance with
HIPAA
• Epsilon budget: E2
Use Case Solution Overview
• Alice: No AIR Policy
• Bob: Give access to addresses but hide all other private
information fields.
• Epsilon budget: E1
• Charlie: Hide all private information fields in accordance with
HIPAA
• Epsilon budget: E2
Example: A Clinical Database
HTTP Server
OpenID
Authentication
• Client Accesses triplestore via
HTTP server.
• OpenID Authentication verifies
user has access to data. Finds
epsilon value,
Example: A Clinical Database
AIR
Reasoner
Privacy
Policies
• AIR reasoner checks incoming
queries for HIPAA violations.
• Privacy policies contain HIPAA
rules.
Example: A Clinical Database
Differential
Privacy
Module
• Differential Privacy applied to
statistical queries.
• Statistical result + noise
returned to client.
Correctness of Differential
Privacy
• Need to test how much noise is added.
• Too much noise = poor results.
• Too little noise = no guarantee.
• Test: Run queries and look at sensitivity calculated vs. actual
sensitivity.
How to test sensitivity?
• Ideally:
• Test noise calculation is correct
• Test that noise makes data still useful (e.g. by applying machine
learning algorithms).
• Fort his project, just tested former
• Machine learning APIs not as prevalent for linked data.
• What results to compare to?
Test suite
• 10 queries for each operation (COUNT, SUM, AVG, MIN, MAX)
• 10 different WHERE CLAUSES
• Test:
• Sensitivity calculated from original query
• Remove each personal URI using “MINUS” keyword and see
which removal is most sensitive
Example for Sens Test
• Query:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1#>
PREFIX mimic:
<http://air.csail.mit.edu/spim_ontologies/mimicOntology#>
SELECT (SUM(?o) as ?aggr) WHERE{
?s foaf:name ?n.
?s mimic:event ?e.
?e mimic:m1 "Insulin".
?e mimic:v1 ?o.
FILTER(isNumeric(?o))
}
Example for Sens Test
• Sensitivity query:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1#>
PREFIX mimic:
<http://air.csail.mit.edu/spim_ontologies/mimicOntology#>
SELECT (SUM(?o) as ?aggr) WHERE{
?s foaf:name ?n.
?s mimic:event ?e.
?e mimic:m1 "Insulin".
?e mimic:v1 ?o.
FILTER(isNumeric(?o))
MINUS {?s foaf:name "%s"}
} % (name)
Results Query 6 - Error
Runtime
• Queries were also tested for runtime.
• Bigger WHERE clauses
• More keywords
• Extra overhead of doing the calculations.
Results Query 6 - Runtime
Interpretation
• Sensitivity calculation time on-par with query time
• Might not be good for big data
• Find ways to reduce sensitivity calculation time?
• AVG does not do so well…
•
•
•
•
Approximation yields too much noise vs. trying all possibilities
Runs ~4x slower than simple querying
Solution 1: Look at all data manually (large data transfer)
Solution 2: Can we use NOISY_SUM / NOISY_COUNT instead?
Conclusion
Contributions
• Theory on how to apply differential privacy to linked data.
• Overall privacy module for SPARQL queries.
• Limited but a good start
• Experimental implementation of differential privacy.
• Verification that it is applied correctly.
• Other:
• Updated sparql to n3 translation to Sparql version 1.1
• Expanded upon IARPA project to create policies against statistical
queries.
Shortcomings and Future Work
• Triplestores need some structure for this to work
• Personal information must be explicitly defined in triples.
• Is there a way to automatically detect what triples would
constitute private information?
• Complexity
• Lots of noise for sparse data.
• Can divide data into disjoint sets to reduce noise like PINQ does
• Use localized sensitivity measures?
• Third party software problems
• Would this work better using a different Triplestore
implementation?
Diff. Privacy and an Open Web
• How applicable is this to an open web?
• High sample numbers, but potentially high data variance.
• Sensitivity calculation might take too long, need to approximate.
• Can use disjoint subsets of the web to increase number of
queries with Ι› budgets.
Demo
• air.csail.mit.edu:8800/spim_module/
References
• Differential Privacy Implementations:
• “Privacy Integrated Queries (PINQ)” by Frank McSherry:
http://research.microsoft.com/pubs/80218/sigmod115mcsherry.pdf
• “Airavat: Security and Privacy for MapReduce” by Roy, Indrajit;
Setty, Srinath T. V. ; Kilzer, Ann; Shmatikov, Vitaly; and Witchel,
Emmet: http://www.cs.utexas.edu/~shmat/shmat_nsdi10.pdf
• “Towards Statistical Queries over Distributed Private User Data”
by Chen, Ruichuan; Reznichenko, Alexey; Francis, Paul; Gehrke,
Johannes: https://www.usenix.org/conference/nsdi12/towardsstatistical-queries-over-distributed-private-user-data
References
• Theoretical Work
• “Differential Privacy” by Cynthia Dwork:
http://research.microsoft.com/pubs/64346/dwork.pdf
• “Mechanism Design via Differential Privacy” by McSherry, Frank;
and Talwar, Kunal:
http://research.microsoft.com/pubs/65075/mdviadp.pdf
• “Calibrating Noise to Sensitivity in Private Data Analysis” by
Dwork, Cynthia; McSherry, Frank; Nissim, Kobbi; and Smith,
Adam: http://people.csail.mit.edu/asmith/PS/sensitivity-tccfinal.pdf
• “Differential Privacy for Clinical Trail Data: Preliminary
Evaluations”, by Vu, Duy; and SlavkoviΔ‡, Aleksandra:
http://sites.stat.psu.edu/~sesa/Research/Papers/padm09sesaSep
24.pdf
References
• Other
• “Privacy Concerns of FOAF-Based Linked Data” by Nasirifard,
Peyman; Hausenblas, Michael; and Decker, Stefan:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.153.5
772
• “The Mosaic Theory, National Security, and the Freedom of
Information Act”, by David E. Pozen
http://www.yalelawjournal.org/pdf/115-3/Pozen.pdf
• “A Privacy Preference Ontology (PPO) for Linked Data”, by Sacco,
Owen; and Passant, Alexandre: http://ceur-ws.org/Vol813/ldow2011-paper01.pdf
• “k-Anonimity: A Model for Protecting Privacy”, by Latanya
Sweeney:
http://arbor.ee.ntu.edu.tw/archive/ppdm/Anonymity/SweeneyK
A02.pdf
References
• Other
• “Approximation Algorithms for k-Anonimity”, by Aggarwal, Gagan;
Feder, Tomas; Kenthapadi, Krishnaram; Motwani, Rajeev;
Panigraphy, Rina; Thomas, Dilys; and Zhu, An:
http://research.microsoft.com/pubs/77537/k-anonymity-jopt.pdf
Appendix: Results Q1, Q2
Q1
COUNT
Error
Q2
COUNT
SUM
AVG
MAX
MIN
Error
Query_Time
Sens_Calc_Time
0
0.020976
0.05231
Query_Time
Sens_Calc_Time
0
0.015823126
0.011798859
0
0.010298967
0.01198101
868.8379
0.010334969
0.04432416
0
0.010645866
0.012124062
0
0.010524988
0.012120962
Appendix: Results Q3, Q4
Q3
COUNT
SUM
AVG
MAX
MIN
Error
Query_Time
Sens_Calc_Time
0
0.007927895
0.00800705
0
0.007529974
0.007997036
375.8253
0.00763011
0.030416012
0
0.007451057
0.008117914
0
0.007512093
0.008100986
Q4
COUNT
SUM
AVG
MAX
MIN
Error
Query_Time
Sens_Calc_Time
0
0.01048708
0.012546062
0
0.01123786
0.012809038
860.91
0.011286974
0.048202038
0
0.01145792
0.01297307
0
0.011392117
0.012881041
Appendix: Results Q5, Q6
Q5
COUNT
SUM
AVG
MAX
MIN
Error
Query_Time
Sens_Calc_Time
0
0.08081007
0.098078012
0
0.085678816
0.097680092
115099.5
0.087270975
0.373119116
0
0.084903955
0.097922087
0
0.083213806
0.098366022
Q6
COUNT
SUM
AVG
MAX
MIN
Error
Query_Time
Sens_Calc_Time
0
0.136605978
0.153807878
0
0.139995098
0.155878067
115118.4
0.139881134
0.616436958
0
0.148360014
0.160467148
0
0.144635916
0.158998966
Appendix: Results Q7, Q8
Q7
COUNT
SUM
AVG
MAX
MIN
Error
Query_Time
Sens_Calc_Time
0
0.006100178
0.004678965
0
0.004260063
0.004747868
0
0.004283905
0.017117977
0
0.004103184
0.004703999
0
0.004188061
0.004717112
Q8
COUNT
SUM
AVG
MAX
MIN
Error
Query_Time
Sens_Calc_Time
0
0.002182961
0.002643108
0
0.002092123
0.002592087
0
0.002075911
0.002662182
0
0.00207901
0.002576113
0
0.002048969
0.002597094
Appendix: Results Q9, Q10
Q9
COUNT
SUM
AVG
MAX
MIN
Error
Query_Time
Sens_Calc_Time
0
0.004920959
0.010298014
0
0.004822016
0.010312796
0.00037
0.004909992
0.024574041
0
0.004843235
0.01032114
0
0.004893064
0.010319948
Q10
COUNT
SUM
AVG
MAX
MIN
Error
Query_Time
Sens_Calc_Time
0
0.012365818
0.014447212
0
0.013066053
0.014631987
860.91
0.013166904
0.056000948
0
0.013354063
0.014893055
0
0.013329029
0.014914989
Download