The Minimum Exposure Framework Demonstrating how to Limit Data Collection in Application Forms Nicolas Anciaux, INRIA Rocquencourt Benjamin Nguyen, INRIA Rocquencourt & U. of Versailles St Quentin Michalis Vazirgiannis, LIX & Athens U. of Economics and Business With Wallid Bezza, Conseil General des Yvelines Danae Boutara, Ecole Polytechnique Bertrand Le Cun, U. of Paris-X Nanterre & PRiSM Marouane Fazouane, ENSTA ParisTech RENNES, 23 mai 2013 Personal data kept by individuals Hospital Observation : • Individuals receive electronic official documents Doctor’s office • Those documents are treasured by users - On their PC, in the cloud, etc. My bank Personal data My employer Bob My telco Application : Services with complex decision processes • Subsequently used as evidences - E-administration: complex public decision processes e.g., pay taxes, GEVA / CG78,… - Bank or insurance e.g., calibrate rate and duration of a consumer loans, insurance pricing,… - Access control decisions in open environments e.g., evaluation of ABAC policies 2 /20 Current practices for User / Service Interaction • Application Evaluation Process Services (online or offline) provide forms (1) that users must fill (2), using their personal SP Audit Accountability SP’s Store information, in order to appy for services. • This data is processed (3) automatically, Form Gen. Evaluation ③ semi-automatically or manually in order to provide a customized decision for each Decision Making System user application (4). • Data is stored for accountability reasons. Complete Form Service Proposal ④ ② ① A /!\ LIMITED DATA COLLECTION /!\ Telco Bank Work etc… User’s Store Form Filling Empty Form 3 /20 Limited Data Collection Principle : Protect user’s privacy • A well known Privacy principle: - Its goal is to limit the dissemination of personal information with regards to a purpose - This principle is adopted worldwide. • Limited Data Collection in privacy Laws and Directives: - Australia: Privacy Act (1988) - Canada: Personal Information Protection and Electronic Documents Act (PIPEDA) 2000 - EU: European 95/46/EC directive (and 2012-0011 (COD) General Data Protection Regulation) - Worldwide: OECD Guidelines (Prot. of Privacy and Transborder Flows of Personal Data) 1980 • A General Feeling : Over Data Disclosure Over 50% of Europeans feel they are asked more data than necessary, 70% are concerned 4 /20 Is limited collection also beneficial for service providers? • Is there a financial cost if a company requests more information than required ? (With companies, measure in $) • According to recent studies (Ponemon Institute, Forrester Research, 2011 annual report) - The frequency of breaches is important: 90% of companies in 2011 - The cost per breached tuple is huge: about 200$ per tuple in average - The cost depends on volume & type of content (social, health, finance, credit card, etc.) mainly on two dimensions: Ex-post response (20%): actions taken help the victims to minimize the harm Lost business (50%): direct consequence of the negative publicity - The law makes the service provider responsible of the data breach • CG78: manual checking of application forms (a time cost per data item) The more data collected, the greater the cost of a data breach. The more data collected, the greater the cost of processing. The more data collected, the lesser the privacy of the applicant. 5 /20 The Minimum Exposure Principle This work introduces the following principle (strict interpretation of LDC ): “Only a minimum subset of the data required for any given purpose should be collected” GOALS : Minimality: with a strict understanding of LDC Accountability: comprehensible information is needed to verify it (e.g., cross-check with internal databases or copies of other official documents) Broad spectrum: accommodate any kind of decision making techniques (i.e., binary classifiers, multi-class and multi-label) Scalability: users must be able to have an unlimited amount of documents Main problem (Minimality) : It is (NP) difficult to determine a priori what data is necessary. 6 /20 Outline 1.Introduction 2.Minimum Exposure Architecture 3.Minimum Exposure Problem 4.Minimum Exposure Resolution 5.Experiments 6.Conclusion & Future Work 7 /20 The Minimum Exposure Architecture Application Evaluation Process SP SP Audit Accountability SP’s Store Form Form Gen. Gen. Evaluation • To protect the privacy of an individual, we introduce a Service Application Process ③ Decision Making System Collection Rules Gen. Complete Form • New modules : - Collection Rule Generator - Minimum Exposure - Form Scoring • NB: taking the decision at the client side and transmitting results to the SP would not comply with the accountability requirement ④ Service Proposal A Service Proposal ② ① Blanked Form A DP DP Telco Bank Work etc… Telco Bank Work etc… A Minimum Exposure Form Scoring User’s Store Form Filling Collec. Private vs Public Collection Rules [EDBT 2013 (demo)] • Decision processes must in general be Application Evaluation Process - Comprehensible by humans - Justifiable - Public - E.g. tax services, social services, Service Proposal A Blanked Form public health care system Minimum Exposure • Decision processes are sometimes part of the business model (and thus secret), TTP Form Scoring or intrinsically private. Complete Form - Introduction of a Trusted Third Party (trusted by SP and A) - Can in general be executed on the SP A Form Filling Empty Form Service Application Process Complete Form - Introduction of a Trusted Third Party (trusted by SP and A) - Can in general be executed on the SP A Form Filling Empty Form Service Application Process 9 /20 Outline 1.Introduction 2.Minimum Exposure Architecture 3.Minimum Exposure Problem 4.Minimum Exposure Resolution 5.Experiments 6.Conclusion & Future Work 10 /20 Ingredients : Collection rules • To be expressive enough, must cover classical classification scenarios (complex decision making processes are based on multi-label decision trees) • We consider several collection rules ri, each leading to a certain label li • A collection rule ri is made of disjunctions of atomic rules aij • Each atomic rule aij is a conjunctions of predicates pijk • The predicates are of the form (attribute value) with {<,=,>,,,} • e.g., higher loans are offered to wealthy customers Collection rule r1 (year_income>$30K assets>$100K ) (collateral>$50K life_insurance=’yes’) higher_loan Predicate p112 Atomic rule a12 Label l1 11 /20 Ingredients : user assertions • The user produces assertions (i.e. signed predicates, corresponding to values in a document or produced and endorsed by the user) which are exposed to prove the rules (via proof of predicates) - Assertions validate predicates in collection rules - When one atomic rule is proven, the collection rule is also proven (and the associated benefit can be obtained) • We focus on assertions of the form attribute=value - Makes sense for data producers (no technical problem to sign values individually) - We can easily extend to assertions of the form attribute values 12 /20 Ingredients : exposure metrics • From the User privacy perspective: - Can be measured by information loss metrics - E.g., minimal distortion [Samarati & Sweeney], ILoss [Xiao et al.]. • From the service provider perspective (cost of data breach): - Information quantity and type is meaningful (Forrester, Ponemon) - Again, information loss metric is enough • Supported metrics: - Any metric computing the cost using each data item independently - Not captured: e.g., metrics based on exposure history, association, … • For simplicity, we take a basic data exposure measure: EX = |{ exposed assertions }| • Other metrics are possible if they can be modeled by an objective function, using only assertions as inputs. 13 /20 Running example: consumer loan scenario • Unconditionally: $5.000, 10% rate, 1 year duration, $50/month job loss protection. • Wealthy customers: higher loan of $10.000 income>$30.000 and assets>$100.000 or collateral>$50.000 and life_insurance=’yes’ • Families and honest youngsters: part of the loan granted at 0% rate • High revenues families and low risk people: longer duration of 2 years • Rich families and promising young workers: 30% discount on job loss protection (p1 p2 ) (p3 p4) (p5 p6 p7 ) (p4 p8 p9) (p1 p6 p7 ) (p2 p4 p10) (p2 p5 p6 p7) (p1 p4 p8 p9 ) c1 c2 c3 c4 Rules: r1: r2: r3: r4: Predicates: p1:year_income>$30.000 p3:collateral>$50.000 p5:tax_rate>10% p7:children>0 p9:age<30 p2:assets>$100.000 p4:life_insurance=yes p6:married=true p8:education_level=univ p10:insurance_claims<$5.000 Decisions: c1=higher_loan c3=longer_duration c2=lower_rate c4=lower_insurance Assertions: d1: year_income=$35.000 d2: assets=$150.000 d3: collateral=$75.000 d4: life_insurance=yes d5: tax_rate=11.5% d6: married=true d7: children=1 d8: education_level=univ d9: age=25 d10: insurance_claims=$250 14 /20 The Minimum Exposure problem • Form is a set of assertions (attribute=value) owned by the user • R is the set of atomic rules belonging to collection rules that are validated by Form - A predicate p is validated by Form iff d Form : d p • We build a Boolean formula as follows: we simplify the collection rules by removing the atomic rules R, we take the conjunction of those simplified collection rules, and we replace each predicate pijk with a Boolean value B(pijk): ER=Λ ( V ( Λ B(pijk) ) ) i j k Min (Weighted) SAT with B(pijk)=true if a data item d Form is exposed: d pijk, and false otherwise • Computing the minimum exposure of Form means finding a truth assignment T of the variables B(pijk) such that ER = true and EX(Form | T) is minimum • This optimisation problem is NP-Hard • … and we have bad complexity results (not in APX, has diff. approximation ratio of 0-DAPX) 15 /20 Outline 1.Introduction 2.Minimum Exposure Architecture 3.Minimum Exposure Problem 4.Minimum Exposure Resolution 5.Experiments 6.Conclusion & Future Work 16 /20 2: score1[i] 3 3 3 4 4 2 4 2 2 3 3 score2[i] r1: r2: r3: r4: (p1 p2 ) (p3 p4) (p5 p6 p7 ) (p4 p8 p9) (p1 p6 p7 ) (p2 p4 p10) (p2 p5 p6 p7) (p1 p4 p8 p9 ) c1 c2 c3 c4 3: score1[i] score2[i] 4: score1[i] Final B[i] 1 1 3 1 1 0 Kept nodes Covered CR score2[i] atom[8] 3 atom[7] 2 atom[6] 2 atom[5] Nb: Finding heuristics requires knowledge of the problem topology… 1: score1[i] Steps atom[4] Random solution improved with meta heuristics (SA*) (using simulated annealing as a representative) • Takes at random 1 atomic rule per simplified collection rule • Tries to improve that solution (using simulated annealing) • (restarts the process up to a time limit) • Produces the best result - Using specific heuristics : HME, PDS-ME, … atom[3] Takes at random 1 atomic rule per collection rule (Repeats the process up to a time limit) Produces the best result For each atom in the list : 1. Keep the atom 2. Compute Score 1 : number of additional assertions to keep to prove the atom 3. In case of equality: For each atom where Score 1 is minimum Compute Score 2: number of additional predicates proven in remaining atoms 4. Keep atom with min(score 1 || ([atom| - score 2)) atom[2] • Computing exact solutions - We need to solve a Boolean integer non-linearly constrained problem - We use a MINLP solver for that (COUENNE) - But: computation becomes too long when the problem instance grows • Computing approximate solutions - Purely random solution (RAND*) atom[1] Algorithms (see paper) var b1 binary; ... var b10 binary; minimize EX: b1+b2+b3+b4+b5+b6+b7+b8+b9+b10; subject to r1: b1*b2 + b3*b4 >= 1; r2: b5*b6*b7 + b4*b8*b9 >= 1; r3: b1*b6*b7 + b2*b4*b10 >= 1; r4: b2*b5*b6*b7 + b1*b4*b8*b9 >= 1; 1,2 1 6,7 3 5 2 4 3 3 3 B={true,true,false,false,true,true,true,false,false,false} 17 /20 Outline 1.Introduction 2.Minimum Exposure Architecture 3.Minimum Exposure Problem 4.Minimum Exposure Resolution 5.Experiments 6.Conclusion & Future Work 18 /20 Results on synthetic graphes [PST 2012] 60% 40% 20% 0% Exposure reduction (%) 0 100 200 300 400 500 600 |R | number of Collection Rules 100% 80% 60% 80% 60% 40% 20% 0% 700 COUENNE HME SA* RAND* 40% 20% 10 100 1000 |D | number of Documents 100 1000 |D | number of Documents (Log. Scale) 10000 COUENNE HME, RAND*, SA* 600 400 200 0% 10 COUENNE HME SA* RAND* 100% Execution time (sec.) Exposure reduction (%) 80% Exposure reduction (%) COUENNE HME SA* RAND* 100% 10000 0 0 1000 2000 3000 4000 5000 6000 7000 |D | number of Documents Conclusion - The privacy gain is (almost always) important - The scope of exact solution is limited - HME is a good approximation algorithm 19 /20 Framework to obtain rules sets from real data [Fund. Info. 2013] Experimental Framework: 1) Problem Transformation: multi-label dataset & PT3/PT4 classification algorithms => single-class datasets 2) Single-label classication: single-class datasets & JRIP => association rules (dumped into csv files) 3) Collection rules generator: rules & graph generator => multi-label rules set (full graph) 4) Application instantiation : total graph & data instances =>local graphs - 20 Results on real data [Fund. Info. 2013] ENRON: 1702 e-mails and 1001 nominal attributes, categorized into 53 different labels. Rennes 2013 Current Implementations of Limited Data Collection • Transposition to web sites : pioneer work P3P - Transposes the need-to-know and consent principles to web sites - Highlights conflicting policies, but no way to calibrate the data exposed by a user • Representative work : Hippocratic Databases (IBM) - Attribute values (personal) are collected to achieve purposes the user consents to - Assumption: required data for purposes can be distinguished at collection time Holds for simple cases (ordering online => collecting the address of delivery - But : does not hold when usefulness depends on data content (very common case) E.g., what data is useful to decide to grant a loan to an applicant? income=$30.000 and age<25 may be enough, or income=$50.000 regardless of age • Trust negotiation and credential-based access control (interactions between strangers) - Credentials are exchanged while preserving privacy guarantees Bob (including limited collection) - Simple queries, few credentials - Techniques do not scale Credentials Disclosure policies SMC Computes minimum set Alice Resource R Access control policy Credentials Disclosure policies 25/20 The Heuristic Minimum Exposure algorithm • For each assertion in the list - Remove the assertion - Compute the number of assertions being exposed not to loose any class - Restore the assertion • Remove the assertion with lowest score and repeat the process while an assertion can be deleted • Example: Each predicate pi is proven by Assertion di 26 The Heuristic Minimum Exposure algorithm • For each assertion in the list - Remove the assertion - Compute the number of assertions being exposed not to loose any class - Restore the assertion • Remove the assertion with lowest score and repeat the process while an assertion can be deleted • Example: | { d3, d4, d2, d10, d5, d6, d7 } | = 7 Each predicate pi is proven by Assertion di 27 Adaptation of HME to attribute value documents • Consider the same problem, but instead of having: • We have: with the rule set : • The problem is more difficult (several assertions can prove the same predicate) • The exposure metric must be adapted to tackle attribute value assertions - The exposure of an assertion a v is 1 – ( | {x Da : a v = true } | -1 ) / | Da | E.g., if Dsalary = [0; 100.000], Exp(p11)=1 and Exp(p11 )=0.1 - The exposure of a set of assertions is SUM for each attribute of MAX of the exposure of the predicates for the attribute minimize EX2 : E.g., Exp(p1, p11, p2)=MAX(1, 0.1)+1 0.1*AND(b11, AND(NOT(b12), NOT(b1))+0.2*AND(b12, • With this new metric, HME gives: salary > $20.000 instead of salary = $35.000 is exposed NOT(b1))+0.3*b1+b2+b3+b4+b5+b6+b7+b8+b9+b10; r1: b11*b2 + b3*b4 >= 1; r2: b5*b6*b7 + b4*b8*b9 >= 1; r3: b12*b6*b7 + b2*b4*b10 >= 1; r4: b2*b5*b6*b7 + b1*b4*b8*b9 >= 1; imp1: b12-b11 <= 0; imp2: b1-b12 <= 0; 28 PDS-ME algorithm : lower complexity • For each atom in the list - Keep the atom - Compute Score 1 : number of additional assertions to keep to prove the atom - In case of equality: For each atom where Score 1 is minimum Compute Score 2: number of additional predicates proven in remaining atoms - Keep atom with minimum score 1 || score 2 2: score1[i] 3 3 3 4 4 2 4 2 2 3 score2[i] 3: score1[i] score2[i] 4: score1[i] Final B[i] 1 1 3 3 1 1 0 Kept nodes Covered CR score2[i] atom[8] 3 atom[7] 2 atom[6] atom[3] 2 atom[5] atom[2] 1: score1[i] Steps atom[4] atom[1] • Example: 1,2 1 6,7 3 5 2 4 3 3 3 B={true,true,false,false,true,true,true,false,false,false} 29