The Minimum Exposure Framework Demonstrating how to Limit Data Collection in Application Forms Nicolas Anciaux, INRIA Rocquencourt Benjamin Nguyen, INRIA Rocquencourt & U. of Versailles St Quentin Michalis Vazirgiannis, LIX & Athens U. of Economics and Business With Wallid Bezza, Conseil General des Yvelines Danae Boutara, Ecole Polytechnique Bertrand Le Cun, U. of Paris-X Nanterre & PRiSM Marouane Fazouane, ENSTA ParisTech RENNES, 23 mai 2013 Personal data kept by individuals Hospital Observation : • Individuals receive electronic official documents Doctor’s office • Those documents are treasured by users - On their PC, in the cloud, etc. My bank Personal data My employer Bob My telco Application : Services with complex decision processes • Subsequently used as evidences - E-administration: complex public decision processes e.g., pay taxes, GEVA / CG78,… - Bank or insurance e.g., calibrate rate and duration of a consumer loans, insurance pricing,… - Access control decisions in open environments e.g., evaluation of ABAC policies 2 /20 Current practices for User / Service Interaction • Application Evaluation Process Services (online or offline) provide forms (1) that users must fill (2), using their personal SP Audit Accountability SP’s Store information, in order to appy for services. • This data is processed (3) automatically, Form Gen. Evaluation ③ semi-automatically or manually in order to provide a customized decision for each Decision Making System user application (4). • Data is stored for accountability reasons. Complete Form Service Proposal ④ ② ① A /!\ LIMITED DATA COLLECTION /!\ Telco Bank Work etc… User’s Store Form Filling Empty Form 3 /20 Limited Data Collection Principle : Protect user’s privacy • A well known Privacy principle: - Its goal is to limit the dissemination of personal information with regards to a purpose - This principle is adopted worldwide. • Limited Data Collection in privacy Laws and Directives: - Australia: Privacy Act (1988) - Canada: Personal Information Protection and Electronic Documents Act (PIPEDA) 2000 - EU: European 95/46/EC directive (and 2012-0011 (COD) General Data Protection Regulation) - Worldwide: OECD Guidelines (Prot. of Privacy and Transborder Flows of Personal Data) 1980 • A General Feeling : Over Data Disclosure Over 50% of Europeans feel they are asked more data than necessary, 70% are concerned 4 /20 Is limited collection also beneficial for service providers? • Is there a financial cost if a company requests more information than required ? (With companies, measure in $) • According to recent studies (Ponemon Institute, Forrester Research, 2011 annual report) - The frequency of breaches is important: 90% of companies in 2011 - The cost per breached tuple is huge: about 200$ per tuple in average - The cost depends on volume & type of content (social, health, finance, credit card, etc.) mainly on two dimensions: Ex-post response (20%): actions taken help the victims to minimize the harm Lost business (50%): direct consequence of the negative publicity - The law makes the service provider responsible of the data breach • CG78: manual checking of application forms (a time cost per data item) The more data collected, the greater the cost of a data breach. The more data collected, the greater the cost of processing. The more data collected, the lesser the privacy of the applicant. 5 /20 The Minimum Exposure Principle This work introduces the following principle (strict interpretation of LDC ): “Only a minimum subset of the data required for any given purpose should be collected” GOALS : Minimality: with a strict understanding of LDC Accountability: comprehensible information is needed to verify it (e.g., cross-check with internal databases or copies of other official documents) Broad spectrum: accommodate any kind of decision making techniques (i.e., binary classifiers, multi-class and multi-label) Scalability: users must be able to have an unlimited amount of documents Main problem (Minimality) : It is (NP) difficult to determine a priori what data is necessary. 6 /20 Outline 1.Introduction 2.Minimum Exposure Architecture 3.Minimum Exposure Problem 4.Minimum Exposure Resolution 5.Experiments 6.Conclusion & Future Work 7 /20 The Minimum Exposure Architecture Application Evaluation Process SP SP Audit Accountability SP’s Store Form Form Gen. Gen. Evaluation • To protect the privacy of an individual, we introduce a Service Application Process ③ Decision Making System Collection Rules Gen. Complete Form • New modules : - Collection Rule Generator - Minimum Exposure - Form Scoring • NB: taking the decision at the client side and transmitting results to the SP would not comply with the accountability requirement ④ Service Proposal A Service Proposal ② ① Blanked Form A DP DP Telco Bank Work etc… Telco Bank Work etc… A Minimum Exposure Form Scoring User’s Store Form Filling Collec. Rules Empty Form Complete»Form « Unlimited collection approach Form Filling Empty Form User’s Store Service Application Process 8 /20 Private vs Public Collection Rules [EDBT 2013 (demo)] • Decision processes must in general be Application Evaluation Process - Comprehensible by humans - Justifiable - Public - E.g. tax services, social services, Service Proposal A Blanked Form public health care system Minimum Exposure • Decision processes are sometimes part of the business model (and thus secret), TTP Form Scoring or intrinsically private. Complete Form - Introduction of a Trusted Third Party (trusted by SP and A) - Can in general be executed on the SP A Form Filling Empty Form Service Application Process 9 /20 Outline 1.Introduction 2.Minimum Exposure Architecture 3.Minimum Exposure Problem 4.Minimum Exposure Resolution 5.Experiments 6.Conclusion & Future Work 10 /20 Ingredients : Collection rules • To be expressive enough, must cover classical classification scenarios (complex decision making processes are based on multi-label decision trees) • We consider several collection rules ri, each leading to a certain label li • A collection rule ri is made of disjunctions of atomic rules aij • Each atomic rule aij is a conjunctions of predicates pijk • The predicates are of the form (attribute value) with {<,=,>,,,} • e.g., higher loans are offered to wealthy customers Collection rule r1 (year_income>$30K assets>$100K ) (collateral>$50K life_insurance=’yes’) higher_loan Predicate p112 Atomic rule a12 Label l1 11 /20 Ingredients : user assertions • The user produces assertions (i.e. signed predicates, corresponding to values in a document or produced and endorsed by the user) which are exposed to prove the rules (via proof of predicates) - Assertions validate predicates in collection rules - When one atomic rule is proven, the collection rule is also proven (and the associated benefit can be obtained) • We focus on assertions of the form attribute=value - Makes sense for data producers (no technical problem to sign values individually) - We can easily extend to assertions of the form attribute values 12 /20 Ingredients : exposure metrics • From the User privacy perspective: - Can be measured by information loss metrics - E.g., minimal distortion [Samarati & Sweeney], ILoss [Xiao et al.]. • From the service provider perspective (cost of data breach): - Information quantity and type is meaningful (Forrester, Ponemon) - Again, information loss metric is enough • Supported metrics: - Any metric computing the cost using each data item independently - Not captured: e.g., metrics based on exposure history, association, … • For simplicity, we take a basic data exposure measure: EX = |{ exposed assertions }| • Other metrics are possible if they can be modeled by an objective function, using only assertions as inputs. 13 /20 Running example: consumer loan scenario • Unconditionally: $5.000, 10% rate, 1 year duration, $50/month job loss protection. • Wealthy customers: higher loan of $10.000 income>$30.000 and assets>$100.000 or collateral>$50.000 and life_insurance=’yes’ • Families and honest youngsters: part of the loan granted at 0% rate • High revenues families and low risk people: longer duration of 2 years • Rich families and promising young workers: 30% discount on job loss protection (p1 p2 ) (p3 p4) (p5 p6 p7 ) (p4 p8 p9) (p1 p6 p7 ) (p2 p4 p10) (p2 p5 p6 p7) (p1 p4 p8 p9 ) c1 c2 c3 c4 Rules: r1: r2: r3: r4: Predicates: p1:year_income>$30.000 p3:collateral>$50.000 p5:tax_rate>10% p7:children>0 p9:age<30 p2:assets>$100.000 p4:life_insurance=yes p6:married=true p8:education_level=univ p10:insurance_claims<$5.000 Decisions: c1=higher_loan c3=longer_duration c2=lower_rate c4=lower_insurance Assertions: d1: year_income=$35.000 d2: assets=$150.000 d3: collateral=$75.000 d4: life_insurance=yes d5: tax_rate=11.5% d6: married=true d7: children=1 d8: education_level=univ d9: age=25 d10: insurance_claims=$250 14 /20 The Minimum Exposure problem • Form is a set of assertions (attribute=value) owned by the user • R is the set of atomic rules belonging to collection rules that are validated by Form - A predicate p is validated by Form iff d Form : d p • We build a Boolean formula as follows: we simplify the collection rules by removing the atomic rules R, we take the conjunction of those simplified collection rules, and we replace each predicate pijk with a Boolean value B(pijk): ER=Λ ( V ( Λ B(pijk) ) ) i j k Min (Weighted) SAT with B(pijk)=true if a data item d Form is exposed: d pijk, and false otherwise • Computing the minimum exposure of Form means finding a truth assignment T of the variables B(pijk) such that ER = true and EX(Form | T) is minimum • This optimisation problem is NP-Hard • … and we have bad complexity results (not in APX, has diff. approximation ratio of 0-DAPX) 15 /20 Outline 1.Introduction 2.Minimum Exposure Architecture 3.Minimum Exposure Problem 4.Minimum Exposure Resolution 5.Experiments 6.Conclusion & Future Work 16 /20 2: score1[i] 3 3 3 4 4 2 4 2 2 3 3 score2[i] r1: r2: r3: r4: (p1 p2 ) (p3 p4) (p5 p6 p7 ) (p4 p8 p9) (p1 p6 p7 ) (p2 p4 p10) (p2 p5 p6 p7) (p1 p4 p8 p9 ) c1 c2 c3 c4 3: score1[i] score2[i] 4: score1[i] Final B[i] 1 1 3 1 1 0 Kept nodes Covered CR score2[i] atom[8] 3 atom[7] 2 atom[6] 2 atom[5] Nb: Finding heuristics requires knowledge of the problem topology… 1: score1[i] Steps atom[4] Random solution improved with meta heuristics (SA*) (using simulated annealing as a representative) • Takes at random 1 atomic rule per simplified collection rule • Tries to improve that solution (using simulated annealing) • (restarts the process up to a time limit) • Produces the best result - Using specific heuristics : HME, PDS-ME, … atom[3] Takes at random 1 atomic rule per collection rule (Repeats the process up to a time limit) Produces the best result For each atom in the list : 1. Keep the atom 2. Compute Score 1 : number of additional assertions to keep to prove the atom 3. In case of equality: For each atom where Score 1 is minimum Compute Score 2: number of additional predicates proven in remaining atoms 4. Keep atom with min(score 1 || ([atom| - score 2)) atom[2] • Computing exact solutions - We need to solve a Boolean integer non-linearly constrained problem - We use a MINLP solver for that (COUENNE) - But: computation becomes too long when the problem instance grows • Computing approximate solutions - Purely random solution (RAND*) atom[1] Algorithms (see paper) var b1 binary; ... var b10 binary; minimize EX: b1+b2+b3+b4+b5+b6+b7+b8+b9+b10; subject to r1: b1*b2 + b3*b4 >= 1; r2: b5*b6*b7 + b4*b8*b9 >= 1; r3: b1*b6*b7 + b2*b4*b10 >= 1; r4: b2*b5*b6*b7 + b1*b4*b8*b9 >= 1; 1,2 1 6,7 3 5 2 4 3 3 3 B={true,true,false,false,true,true,true,false,false,false} 17 /20 Outline 1.Introduction 2.Minimum Exposure Architecture 3.Minimum Exposure Problem 4.Minimum Exposure Resolution 5.Experiments 6.Conclusion & Future Work 18 /20 Results on synthetic graphes [PST 2012] 60% 40% 20% 0% Exposure reduction (%) 0 100 200 300 400 500 600 |R | number of Collection Rules 100% 80% 60% 80% 60% 40% 20% 0% 700 COUENNE HME SA* RAND* 40% 20% 10 100 1000 |D | number of Documents 100 1000 |D | number of Documents (Log. Scale) 10000 COUENNE HME, RAND*, SA* 600 400 200 0% 10 COUENNE HME SA* RAND* 100% Execution time (sec.) Exposure reduction (%) 80% Exposure reduction (%) COUENNE HME SA* RAND* 100% 10000 0 0 1000 2000 3000 4000 5000 6000 7000 |D | number of Documents Conclusion - The privacy gain is (almost always) important - The scope of exact solution is limited - HME is a good approximation algorithm 19 /20 Framework to obtain rules sets from real data [Fund. Info. 2013] Experimental Framework: 1) Problem Transformation: multi-label dataset & PT3/PT4 classification algorithms => single-class datasets 2) Single-label classication: single-class datasets & JRIP => association rules (dumped into csv files) 3) Collection rules generator: rules & graph generator => multi-label rules set (full graph) 4) Application instantiation : total graph & data instances =>local graphs - 20 Results on real data [Fund. Info. 2013] ENRON: 1702 e-mails and 1001 nominal attributes, categorized into 53 different labels. MEDICAL: a sampling of patients' chest x-ray and renal. It contains 978 instances and 1449 nominal attributes, which fall into 45 different labels (diseases). - 21 Results on social scenario [EDBT 2013 (demo)] Dataset build with the CG78 63 labels, 258 atoms, 440 predicates The rules must be hidden due to the discretionary nature of the decision and to avoid applicants gaming the system. - 22 Ongoing work • Study appropriate metrics to compute the “exposure” value - Collection rules, minimization algorithm, and privacy/utility metrics used to perform minimization (and also potential background knowledge) can be exploited to infer more data than what is exposed - Our current goals 1) capture the impact of such a knowledge on the exposure 2) adapt / find new algorithms to take this into consideration - Recently started with Marouane - (initial considerations will be presented next) • Linearization of the problem (with Bertrand Le Cun) 23 /20 Thanks Questions ? Rennes 2013 Current Implementations of Limited Data Collection • Transposition to web sites : pioneer work P3P - Transposes the need-to-know and consent principles to web sites - Highlights conflicting policies, but no way to calibrate the data exposed by a user • Representative work : Hippocratic Databases (IBM) - Attribute values (personal) are collected to achieve purposes the user consents to - Assumption: required data for purposes can be distinguished at collection time Holds for simple cases (ordering online => collecting the address of delivery - But : does not hold when usefulness depends on data content (very common case) E.g., what data is useful to decide to grant a loan to an applicant? income=$30.000 and age<25 may be enough, or income=$50.000 regardless of age • Trust negotiation and credential-based access control (interactions between strangers) - Credentials are exchanged while preserving privacy guarantees Bob (including limited collection) - Simple queries, few credentials - Techniques do not scale Credentials Disclosure policies SMC Computes minimum set Alice Resource R Access control policy Credentials Disclosure policies 25/20 The Heuristic Minimum Exposure algorithm • For each assertion in the list - Remove the assertion - Compute the number of assertions being exposed not to loose any class - Restore the assertion • Remove the assertion with lowest score and repeat the process while an assertion can be deleted • Example: Each predicate pi is proven by Assertion di 26 The Heuristic Minimum Exposure algorithm • For each assertion in the list - Remove the assertion - Compute the number of assertions being exposed not to loose any class - Restore the assertion • Remove the assertion with lowest score and repeat the process while an assertion can be deleted • Example: | { d3, d4, d2, d10, d5, d6, d7 } | = 7 Each predicate pi is proven by Assertion di 27 Adaptation of HME to attribute value documents • Consider the same problem, but instead of having: • We have: with the rule set : • The problem is more difficult (several assertions can prove the same predicate) • The exposure metric must be adapted to tackle attribute value assertions - The exposure of an assertion a v is 1 – ( | {x Da : a v = true } | -1 ) / | Da | E.g., if Dsalary = [0; 100.000], Exp(p11)=1 and Exp(p11 )=0.1 - The exposure of a set of assertions is SUM for each attribute of MAX of the exposure of the predicates for the attribute minimize EX2 : E.g., Exp(p1, p11, p2)=MAX(1, 0.1)+1 0.1*AND(b11, AND(NOT(b12), NOT(b1))+0.2*AND(b12, • With this new metric, HME gives: salary > $20.000 instead of salary = $35.000 is exposed NOT(b1))+0.3*b1+b2+b3+b4+b5+b6+b7+b8+b9+b10; r1: b11*b2 + b3*b4 >= 1; r2: b5*b6*b7 + b4*b8*b9 >= 1; r3: b12*b6*b7 + b2*b4*b10 >= 1; r4: b2*b5*b6*b7 + b1*b4*b8*b9 >= 1; imp1: b12-b11 <= 0; imp2: b1-b12 <= 0; 28 PDS-ME algorithm : lower complexity • For each atom in the list - Keep the atom - Compute Score 1 : number of additional assertions to keep to prove the atom - In case of equality: For each atom where Score 1 is minimum Compute Score 2: number of additional predicates proven in remaining atoms - Keep atom with minimum score 1 || score 2 2: score1[i] 3 3 3 4 4 2 4 2 2 3 score2[i] 3: score1[i] score2[i] 4: score1[i] Final B[i] 1 1 3 3 1 1 0 Kept nodes Covered CR score2[i] atom[8] 3 atom[7] 2 atom[6] atom[3] 2 atom[5] atom[2] 1: score1[i] Steps atom[4] atom[1] • Example: 1,2 1 6,7 3 5 2 4 3 3 3 B={true,true,false,false,true,true,true,false,false,false} 29