Minimum Exposure - Pr. Benjamin NGUYEN

advertisement
The Minimum Exposure Framework
Demonstrating how to Limit Data Collection in Application Forms
Nicolas Anciaux, INRIA Rocquencourt
Benjamin Nguyen, INRIA Rocquencourt & U. of Versailles St Quentin
Michalis Vazirgiannis, LIX & Athens U. of Economics and Business
With
Wallid Bezza, Conseil General des Yvelines
Danae Boutara, Ecole Polytechnique
Bertrand Le Cun, U. of Paris-X Nanterre & PRiSM
Marouane Fazouane, ENSTA ParisTech
RENNES, 23 mai 2013
Personal data kept by individuals
Hospital
Observation :
• Individuals receive electronic official documents
Doctor’s office
• Those documents are treasured by users
- On their PC, in the cloud, etc.
My bank
Personal
data
My employer
Bob
My telco
Application :
Services with complex decision processes
• Subsequently used as evidences
- E-administration: complex public decision processes
e.g., pay taxes, GEVA / CG78,…
- Bank or insurance
e.g., calibrate rate and duration of a consumer loans, insurance pricing,…
- Access control decisions in open environments
e.g., evaluation of ABAC policies
2 /20
Current practices for User / Service Interaction
•
Application Evaluation Process
Services (online or offline) provide forms (1)
that users must fill (2), using their personal
SP
Audit
Accountability
SP’s Store
information, in order to appy for services.
•
This data is processed (3) automatically,
Form
Gen.
Evaluation
③
semi-automatically or manually in order to
provide a customized decision for each
Decision
Making System
user application (4).
•
Data is stored for accountability reasons.
Complete
Form
Service
Proposal
④
②
①
A
/!\ LIMITED DATA COLLECTION /!\
Telco
Bank
Work
etc…
User’s Store
Form
Filling
Empty Form
3 /20
Limited Data Collection Principle :
Protect user’s privacy
• A well known Privacy principle:
- Its goal is to limit the dissemination of personal information with regards to a purpose
- This principle is adopted worldwide.
• Limited Data Collection in privacy Laws and Directives:
- Australia: Privacy Act (1988)
- Canada: Personal Information Protection and Electronic Documents Act (PIPEDA) 2000
- EU: European 95/46/EC directive (and 2012-0011 (COD) General Data Protection
Regulation)
- Worldwide: OECD Guidelines (Prot. of Privacy and Transborder Flows of Personal Data)
1980
• A General Feeling : Over Data Disclosure
Over 50% of Europeans feel they are asked more data than necessary,
70% are concerned
4 /20
Is limited collection also beneficial for service
providers?
• Is there a financial cost if a company requests more information than
required ? (With companies, measure in $)
• According to recent studies (Ponemon Institute, Forrester Research,
2011 annual report)
- The frequency of breaches is important: 90% of companies in 2011
- The cost per breached tuple is huge: about 200$ per tuple in average
- The cost depends on volume & type of content (social, health, finance, credit
card, etc.) mainly on two dimensions:
Ex-post response (20%): actions taken help the victims to minimize the harm
Lost business (50%): direct consequence of the negative publicity
- The law makes the service provider responsible of the data breach
• CG78: manual checking of application forms (a time cost per data item)
The more data collected, the greater the cost of a data breach.
The more data collected, the greater the cost of processing.
The more data collected, the lesser the privacy of the applicant.
5 /20
The Minimum Exposure Principle
This work introduces the following principle (strict interpretation of LDC ):
“Only a minimum subset of the data required for any given purpose
should be collected”
GOALS :
Minimality: with a strict understanding of LDC
Accountability: comprehensible information is needed to verify it
(e.g., cross-check with internal databases or copies of other official documents)
Broad spectrum: accommodate any kind of decision making techniques
(i.e., binary classifiers, multi-class and multi-label)
Scalability: users must be able to have an unlimited amount of documents
Main problem (Minimality) : It is (NP) difficult to determine a priori what data is
necessary.
6 /20
Outline
1.Introduction
2.Minimum Exposure Architecture
3.Minimum Exposure Problem
4.Minimum Exposure Resolution
5.Experiments
6.Conclusion & Future Work
7 /20
The Minimum Exposure
Architecture
Application Evaluation Process
SP
SP
Audit
Accountability
SP’s Store
Form
Form
Gen.
Gen.
Evaluation
• To protect the privacy of an individual, we
introduce a Service Application Process
③
Decision
Making System
Collection
Rules Gen.
Complete
Form
• New modules :
- Collection Rule Generator
- Minimum Exposure
- Form Scoring
• NB: taking the decision at the client side and
transmitting results to the SP would not
comply with the accountability requirement
④
Service
Proposal A
Service
Proposal
②
①
Blanked Form
A
DP
DP
Telco
Bank
Work
etc…
Telco
Bank
Work
etc…
A
Minimum Exposure
Form Scoring
User’s
Store
Form
Filling Collec. Rules
Empty Form
Complete»Form
« Unlimited
collection
approach
Form Filling
Empty Form
User’s Store
Service Application Process
8 /20
Private vs Public Collection Rules
[EDBT 2013 (demo)]
• Decision processes must in general be
Application Evaluation Process
- Comprehensible by humans
- Justifiable
- Public
- E.g. tax services, social services,
Service
Proposal
A
Blanked Form
public health care system
Minimum Exposure
• Decision processes are sometimes part
of the business model (and thus secret),
TTP
Form Scoring
or intrinsically private.
Complete
Form
- Introduction of a Trusted Third Party
(trusted by SP and A)
- Can in general be executed on the SP
A
Form Filling
Empty Form
Service Application Process
9 /20
Outline
1.Introduction
2.Minimum Exposure Architecture
3.Minimum Exposure Problem
4.Minimum Exposure Resolution
5.Experiments
6.Conclusion & Future Work
10 /20
Ingredients : Collection rules
• To be expressive enough, must cover classical classification scenarios
(complex decision making processes are based on multi-label decision trees)
• We consider several collection rules ri, each leading to a certain label li
• A collection rule ri is made of disjunctions of atomic rules aij
• Each atomic rule aij is a conjunctions of predicates pijk
• The predicates are of the form (attribute  value) with  {<,=,>,,,}
• e.g., higher loans are offered to wealthy customers
Collection rule r1
(year_income>$30K  assets>$100K )  (collateral>$50K  life_insurance=’yes’)  higher_loan
Predicate p112
Atomic rule a12
Label l1
11 /20
Ingredients : user assertions
• The user produces assertions (i.e. signed predicates, corresponding to values in a
document or produced and endorsed by the user) which are exposed to prove the
rules (via proof of predicates)
- Assertions validate predicates in collection rules
- When one atomic rule is proven, the collection rule is also proven
(and the associated benefit can be obtained)
• We focus on assertions of the form attribute=value
- Makes sense for data producers (no technical problem to sign values
individually)
- We can easily extend to assertions of the form attribute  values
12 /20
Ingredients : exposure metrics
• From the User privacy perspective:
- Can be measured by information loss metrics
- E.g., minimal distortion [Samarati & Sweeney], ILoss [Xiao et al.].
• From the service provider perspective (cost of data breach):
- Information quantity and type is meaningful (Forrester, Ponemon)
- Again, information loss metric is enough
• Supported metrics:
- Any metric computing the cost using each data item independently
- Not captured: e.g., metrics based on exposure history, association, …
• For simplicity, we take a basic data exposure measure:
EX = |{ exposed assertions }|
• Other metrics are possible if they can be modeled by an objective function,
using only assertions as inputs.
13 /20
Running example: consumer loan scenario
• Unconditionally: $5.000, 10% rate, 1 year duration, $50/month job loss protection.
• Wealthy customers: higher loan of $10.000
income>$30.000 and assets>$100.000 or collateral>$50.000 and life_insurance=’yes’
• Families and honest youngsters: part of the loan granted at 0% rate
• High revenues families and low risk people: longer duration of 2 years
• Rich families and promising young workers: 30% discount on job loss protection
(p1  p2 )  (p3  p4)
(p5  p6  p7 )  (p4  p8  p9)
(p1  p6  p7 )  (p2  p4  p10)
(p2  p5  p6  p7)  (p1  p4  p8  p9 )
 c1
 c2
 c3
 c4
Rules:
r1:
r2:
r3:
r4:
Predicates:
p1:year_income>$30.000
p3:collateral>$50.000
p5:tax_rate>10%
p7:children>0
p9:age<30
p2:assets>$100.000
p4:life_insurance=yes
p6:married=true
p8:education_level=univ
p10:insurance_claims<$5.000
Decisions:
c1=higher_loan
c3=longer_duration
c2=lower_rate
c4=lower_insurance
Assertions:
d1: year_income=$35.000
d2: assets=$150.000
d3: collateral=$75.000
d4: life_insurance=yes
d5: tax_rate=11.5%
d6: married=true
d7: children=1
d8: education_level=univ
d9: age=25
d10: insurance_claims=$250
14 /20
The Minimum Exposure problem
• Form is a set of assertions (attribute=value) owned by the user
• R is the set of atomic rules belonging to collection rules that are validated by Form
- A predicate p is validated by Form iff  d  Form : d  p
• We build a Boolean formula as follows:
we simplify the collection rules by removing the atomic rules  R,
we take the conjunction of those simplified collection rules,
and we replace each predicate pijk with a Boolean value B(pijk):
ER=Λ ( V ( Λ B(pijk) ) )
i
j
k
Min (Weighted) SAT
with B(pijk)=true if a data item d  Form is exposed: d  pijk, and false otherwise
• Computing the minimum exposure of Form means finding a truth assignment T of the
variables B(pijk) such that ER = true and EX(Form | T) is minimum
• This optimisation problem is NP-Hard
• … and we have bad complexity results
(not in APX, has diff. approximation ratio of 0-DAPX)
15 /20
Outline
1.Introduction
2.Minimum Exposure Architecture
3.Minimum Exposure Problem
4.Minimum Exposure Resolution
5.Experiments
6.Conclusion & Future Work
16 /20


2: score1[i]



3
3
3

4
4
2
4

2
2
3





3
score2[i]
r1:
r2:
r3:
r4:
(p1  p2 )  (p3  p4)
(p5  p6  p7 )  (p4  p8  p9)
(p1  p6  p7 )  (p2  p4  p10)
(p2  p5  p6  p7)  (p1  p4  p8  p9 )
 c1
 c2
 c3
 c4
3: score1[i]
score2[i]
4: score1[i]
Final B[i]




1
1

3
1
1
0
Kept
nodes
Covered
CR
score2[i]
atom[8]
3
atom[7]
2
atom[6]
2
atom[5]
Nb: Finding heuristics requires knowledge of the problem topology…
1: score1[i]
Steps
atom[4]
Random solution improved with meta heuristics (SA*)
(using simulated annealing as a representative)
• Takes at random 1 atomic rule per simplified collection rule
• Tries to improve that solution (using simulated annealing)
• (restarts the process up to a time limit)
• Produces the best result
- Using specific heuristics : HME, PDS-ME, …
atom[3]
Takes at random 1 atomic rule per collection rule
(Repeats the process up to a time limit)
Produces the best result
For each atom in the list :
1.
Keep the atom
2.
Compute Score 1 : number of additional
assertions to keep to prove the atom
3.
In case of equality:
For each atom where Score 1 is minimum
Compute Score 2: number of additional
predicates proven in remaining atoms
4.
Keep atom with min(score 1 || ([atom| - score 2))
atom[2]
• Computing exact solutions
- We need to solve a Boolean integer non-linearly
constrained problem
- We use a MINLP solver for that (COUENNE)
- But: computation becomes too long when the problem
instance grows
• Computing approximate solutions
- Purely random solution (RAND*)
atom[1]
Algorithms (see paper)
var b1 binary; ... var b10 binary;
minimize EX:
b1+b2+b3+b4+b5+b6+b7+b8+b9+b10;
subject to
r1: b1*b2 + b3*b4 >= 1;
r2: b5*b6*b7 + b4*b8*b9 >= 1;
r3: b1*b6*b7 + b2*b4*b10 >= 1;
r4: b2*b5*b6*b7 + b1*b4*b8*b9 >= 1;
1,2
1
6,7
3
5
2
4
3
3
3
B={true,true,false,false,true,true,true,false,false,false}
17 /20
Outline
1.Introduction
2.Minimum Exposure Architecture
3.Minimum Exposure Problem
4.Minimum Exposure Resolution
5.Experiments
6.Conclusion & Future Work
18 /20
Results on synthetic graphes [PST 2012]
60%
40%
20%
0%
Exposure reduction (%)
0
100
200 300 400 500 600
|R | number of Collection Rules
100%
80%
60%
80%
60%
40%
20%
0%
700
COUENNE
HME
SA*
RAND*
40%
20%
10
100
1000
|D | number of Documents
100
1000
|D | number of Documents (Log. Scale)
10000
COUENNE
HME, RAND*, SA*
600
400
200
0%
10
COUENNE
HME
SA*
RAND*
100%
Execution time (sec.)
Exposure reduction (%)
80%
Exposure reduction (%)
COUENNE
HME
SA*
RAND*
100%
10000
0
0
1000 2000 3000 4000 5000 6000 7000
|D | number of Documents
Conclusion
- The privacy gain is (almost always) important
- The scope of exact solution is limited
- HME is a good approximation algorithm
19 /20
Framework to obtain rules sets from real data
[Fund. Info. 2013]
Experimental Framework:
1) Problem Transformation: multi-label dataset & PT3/PT4 classification algorithms
=> single-class datasets
2) Single-label classication: single-class datasets & JRIP
=> association rules (dumped into csv files)
3) Collection rules generator: rules & graph generator
=> multi-label rules set (full graph)
4) Application instantiation : total graph & data instances
=>local graphs
- 20
Results on real data [Fund. Info. 2013]
ENRON: 1702 e-mails and 1001 nominal attributes, categorized into 53 different labels.
MEDICAL: a sampling of patients' chest x-ray and renal. It contains 978 instances and 1449
nominal attributes, which fall into 45 different labels (diseases).
- 21
Results on social scenario [EDBT 2013 (demo)]
Dataset build with the CG78
63 labels, 258 atoms, 440 predicates
The rules must be hidden due to
the discretionary nature of the decision and
to avoid applicants gaming the system.
- 22
Ongoing work
• Study appropriate metrics to compute the “exposure” value
- Collection rules, minimization algorithm, and privacy/utility metrics
used to perform minimization (and also potential background
knowledge) can be exploited to infer more data than what is exposed
- Our current goals
1) capture the impact of such a knowledge on the exposure
2) adapt / find new algorithms to take this into consideration
- Recently started with Marouane
- (initial considerations will be presented next)
• Linearization of the problem (with Bertrand Le Cun)
23 /20
Thanks
Questions ?
Rennes 2013
Current Implementations of Limited Data
Collection
• Transposition to web sites : pioneer work P3P
- Transposes the need-to-know and consent principles to web sites
- Highlights conflicting policies, but no way to calibrate the data exposed by a user
• Representative work : Hippocratic Databases (IBM)
- Attribute values (personal) are collected to achieve purposes the user consents to
- Assumption: required data for purposes can be distinguished at collection time
Holds for simple cases (ordering online => collecting the address of delivery
- But : does not hold when usefulness depends on data content (very common case)
E.g., what data is useful to decide to grant a loan to an applicant?
income=$30.000 and age<25 may be enough, or income=$50.000 regardless of age
• Trust negotiation and credential-based access control (interactions between strangers)
- Credentials are exchanged while
preserving privacy guarantees
Bob
(including limited collection)
- Simple queries, few credentials
- Techniques do not scale
Credentials
Disclosure policies
SMC
Computes
minimum
set
Alice
Resource R
Access control policy
Credentials
Disclosure policies
25/20
The Heuristic Minimum Exposure algorithm
• For each assertion in the list
- Remove the assertion
- Compute the number of assertions being exposed not to loose any class
- Restore the assertion
• Remove the assertion with lowest score and repeat the process while an assertion can
be deleted
• Example:
Each predicate pi is proven by
Assertion di
26
The Heuristic Minimum Exposure algorithm
• For each assertion in the list
- Remove the assertion
- Compute the number of assertions being exposed not to loose any class
- Restore the assertion
• Remove the assertion with lowest score and repeat the process while an assertion can
be deleted
• Example:


| { d3, d4, d2, d10,
d5, d6, d7 } | = 7

Each predicate pi is proven by
Assertion di
27
Adaptation of HME
to attribute  value documents
• Consider the same problem, but instead of having:
• We have:
with the rule set :
• The problem is more difficult (several assertions can prove the same predicate)
• The exposure metric must be adapted to tackle attribute  value assertions
- The exposure of an assertion a v is 1 – ( | {x  Da : a v = true } | -1 ) / | Da |
E.g., if Dsalary = [0; 100.000], Exp(p11)=1 and Exp(p11 )=0.1
- The exposure of a set of assertions is SUM for each attribute of MAX of the
exposure of the predicates for the attribute
minimize EX2 :
E.g., Exp(p1, p11, p2)=MAX(1, 0.1)+1 0.1*AND(b11, AND(NOT(b12), NOT(b1))+0.2*AND(b12,
• With this new metric, HME gives:
salary > $20.000 instead of
salary = $35.000 is exposed
NOT(b1))+0.3*b1+b2+b3+b4+b5+b6+b7+b8+b9+b10;
r1: b11*b2 + b3*b4 >= 1;
r2: b5*b6*b7 + b4*b8*b9 >= 1;
r3: b12*b6*b7 + b2*b4*b10 >= 1;
r4: b2*b5*b6*b7 + b1*b4*b8*b9 >= 1;
imp1: b12-b11 <= 0;
imp2: b1-b12 <= 0;
28
PDS-ME algorithm : lower complexity
• For each atom in the list
- Keep the atom
- Compute Score 1 : number of additional assertions to keep to prove the atom
- In case of equality: For each atom where Score 1 is minimum
Compute Score 2: number of additional predicates proven in remaining atoms
- Keep atom with minimum score 1 || score 2


2: score1[i]



3
3
3

4
4
2
4

2
2


3
score2[i]
3: score1[i]
score2[i]
4: score1[i]
Final B[i]




1
1

3



3
1
1
0
Kept
nodes
Covered
CR
score2[i]
atom[8]
3
atom[7]
2
atom[6]
atom[3]
2
atom[5]
atom[2]
1: score1[i]
Steps
atom[4]
atom[1]
• Example:
1,2
1
6,7
3
5
2
4
3
3
3
B={true,true,false,false,true,true,true,false,false,false}
29
Download