Content-Based Access Control Wenrong Zeng wrzeng@ittc.ku.edu Dept. of Electrical Engineering and Computer Science Advisor: Dr. Luo Committee Members: Dr. Agah Dr. Grzymala-Busse Dr. Kulkarni Dr. Ho Acknowledgements I owe my thanks to my committee members: • Dr. Luo • Dr. Agah • Dr. Grzymala-Busse • Dr. Kulkarni • Dr. Ho 1 Outline • Introduction • The Content-Based Access Control Model • Content-Based Access Control Enforcement • Content-Based Blocking and Tagging • Multi-Label Learning • Discussion and Conclusion 2 Outline • Introduction • The Content-Based Access Control Model • Content-Based Access Control Enforcement • Content-Based Blocking and Tagging • Multi-Label Learning • Discussion and Conclusion 3 Introduction • Education Background & Publication • Motivation & Goal • Background & Related Work 4 Education Background • B. S. – Major: • M. E. – Major: Peking University Electrical Engineering Chinese Academy of Sciences 2009 Computer Science • PhD student – Major: 2006 University of Kansas Computer Science 5 Present Publication – Wenrong Zeng, Yuhao Yang, Bo Luo: Using Data Content to Assist Access Control for Large-Scale Content-Centric Databases. In IEEE International Conference on Big Data (IEEE BigData), 2014 (Acceptance rate: 18.5%) – Wenrong Zeng, Xuewen Chen, Hong Cheng: Pseudo labels for imbalanced multi-label learning. DSAA 2014: 25-31. – Wenrong Zeng, Yuhao Yang, Bo Luo: Access control for big data using data content. In IEEE International Conference on Big Data (IEEE BigData), 2014: 45-47 (Poster). – Wenrong Zeng, Xuewen Chen, Hong Cheng and Jing Hua, Multi-Space Learning for Image Classication Using AdaBoost and Markov Random Fields, Solving Comeplex Machine Learning Problems with Ensemble Methods Workshop, 2013. – Yi Jia, Wenrong Zeng, Jun Huan: Non-stationary bayesian networks based on perfect simulation. In ACM Conference on Information and Knowledge Management, 2012: 1095-1104. (Acceptance rate: 13.4%) – Weiming Hu, Dan Xie, Zhouyu Fu, Wenrong Zeng, Stephen J. Maybank: Semantic-Based Surveillance Video Retrieval. IEEE Transactions on Image Processing 16(4): 1168-1181 (2007). 6 Motivation • Data tends to be content-centric. – Healthcare: 500 million patient databases nationwide. – Telecom: Largest Volume of one unique database: 312 TB comprises AT&T’s calling records. – Business: By 2020, business transactions data on internet will reach 450 billion per day. (IDC) • Nearly Every Field faces Big Data Issue Volume Variety Velocity http://stablekernel.com/blog/wp-content/uploads/2015/02/Big-Data.jpg 7 Motivation • With big data, conventional database access control mechanisms may be insufficient. • Long term goal: smart access control decisions for big data without extensive labor of the DBA. http://blog.varonis.com/big-data-security/ 8 Motivation • Example – A law enforcement agency (e.g. FBI) holds a database of highly sensitive case records. • Large amount of records • Unstructured content – A supervisor assigns a case to agent Alice for investigation. – Naturally, the supervisor also needs to grant Alice access to all related or similar cases. 9 Motivation • Manual assignment – The supervisor manually selects “related cases”. – Extremely labor intensive, practically impossible • Multi-level security – Alice can access all the cases with equal or lower security levels. – Over privileged users! • Attribute based access control – E.g. Alice can access all the robbery records within 5 years, in Area X, in which the suspect is 6-foot tall. – Attributes require manual input, usually not available. 10 Goals Smart access control decision. • Develop content-based access control model, which is data-driven. • Enforce content-based access control model efficiently. • Assumptions: – Basic privileges: users are authenticated with basic trust (e.g. with MLS) – Data-driven: large amounts of content-centric data, access control model must be data-driven. – Lack of explicit authorization – Approximation is allowed 11 Conventional Methods • Role-Based Access Control: Bob, an adult, can drink wine. sbj. role obj. • Attribute-Based Access Control: Bob, who is 24 years old, can drink wine. sbj. obj. age attribute 12 Current Issues • Difficult to define granular access controls. • Lack the ability to implement abstract access control policies (e.g. Similar documents) • Access control models are NOT content-driven. “A truly comprehensive approach for data protection must include mechanisms for enforcing access control policies based on data contents ….” E. Bertino, et. al. Access Control for Databases: Concepts and Systems. Vol. 3. No. 1-2. Now Publishers Inc, 2011. 13 Text Feature Extraction – TF-IDF: Term Frequency Inverse Document Frequency N tfidf (t, d, D) = tf (t, d) ´ log | {d Î D : t Î d } | – Topic Modeling: Non-negative Matrix Factorization Based on TF-IDF They are both innately term-distributed features 14 Text Feature Extraction – Where Term-distributed Features Fall Short! D1: privacy preserving similarity assessment for semi-structured data D2: private XML document matching According to TF-IDF, the cosine similarity of D1 and D2 is 0. 15 Text Feature Extraction – TAGME: Topic Modeling with Wikipedia Curated Annotation Doc No. Word(s) Topic Annotation Weight D1 privacy Privacy 0.1279 preserving Historic preservation 0.0017 similarity Homology (biology) 0.0354 assessment Homology (biology) 0.0521 semi-structured data Semi-structured data 0.2727 private Privacy 0.1256 XML XML 0.5375 document Document management system 0.1475 matching Matching principle 0.0509 for D2 16 Outline • Introduction • The Content-Based Access Control Model • Content-Based Access Control Enforcement • Content-Based Blocking and Tagging • Multi-Label Learning • Discussion and Conclusion 17 The Content-Based Access Control Model • Two-phase model – Initial authorization – Content-based authorization • Initial authorization – Conventional access control policy ACR = [subject, object, action, sign] – Each CBAC-user is explicitly grant access to a small set of records: seed set. • Manual selection • Attribute-based rules • Requested by the user 18 The Content-Based Access Control Model • Content-based authorization – Content-based access control policy ACR =[subject, object, action, f (s, di )] – Dynamic “sign” function f (s, di ) = (max(SIMd (s j , di )) > T ) j – To be evaluated on-the-fly – {true, false} based on content similarity between the base set and the object record – Similarity function M SIM d (di , d j ) = åw x ´ simax (di,x , d j,x ) x=1 19 The Content-Based Access Control Model • Content modeling – In SIM d (d i , d j ) = M åw x ´ simax (di,x , d j,x ) x=1 similarity function defined for term or topic ax – Unstructured text attributes (CLOB, Text) • Any text modeling approach could be used • We utilized the vector space model (TF/IDF) in Oracle. 20 Outline • Introduction • The Content-Based Access Control Model • Content-Based Access Control Enforcement • Content-Based Blocking and Tagging • Multi-Label Learning • Discussion and Conclusion 21 Content-Based Access Control Enforcement • Settings – UCI KDD NSF award data: abstracts represent the contentrich information – Use MIT’s SCIgen to add approximately 20x noise data: 2.7M records – Base set: awards PI-ed by the user – CBAC enforced with Oracle Virtual Private Database – tables as follows 22 Content-Based Access Control Enforcement • Experiment – The database runs on a 64-bit Windows 7 system, with Intel R CoreTM 2 Duo CPU E8500 @ 3.16GHz and 4.0GB RAM – Login as 60 randomly selected users to issue the following queries via PL/SQL: 23 Content-Based Access Control Enforcement • Experiment – Three different scenarios for access control: • (R1) an attribute-based access control (ABAC) rule: the user is allowed to access records in a division where he/she has PI-ed an award • (R2) a content-based access control (CBAC) rule: the user is only allowed to access awards that have similar abstracts with the awards in his/her base set; and (R3) a combined • (ABAC+CBAC) rule: R1 AND R2. 24 Content-Based Access Control Enforcement • Basic On-the-Fly CBAC Threshold Results ABAC Query1 ABAC Query2 CBAC Threshold Query1 CBAC Threshold Query2 25 ABAC+CBAC Threshold Query1 ABAC+CBAC Threshold Query2 Content-Based Access Control Enforcement • Basic On-the-Fly Top-10 CBAC Results Offline CBAC Results 26 Content-Based Access Control Enforcement • Issues with CBAC: – Efficiency: content-based similarity assessment is slow – Accuracy: vector space model suffers from lexical ambiguity, especially for short text snippets (e.g. tweet messages) 27 Outline • Introduction • The Content-Based Access Control Model • Content-Based Access Control Enforcement • Content-Based Blocking and Tagging • Multi-Label Learning • Discussion and Conclusion • Your Input 28 Content-Based Blocking and Tagging • Content-based Blocking: – Pre-partition records into semantically similar clusters – Base set s is first compared with class centroids – Query is only evaluated against top x clusters Before Blocking 29 29 Content-Based Blocking and Tagging After Blocking 30 Content-Based Blocking and Tagging CBAC Threshold Query1 with Blocking ABAC+CBAC Threshold Query1 with Blocking CBAC Threshold Query2 with Blocking ABAC+CBAC Threshold Query2 with Blocking 31 CBAC Top-10 with Blocking Content-Based Blocking and Tagging • Data annotation is performed off-line. Efficiency is not an issue • We use: – Non-negative Matrix Factorization with 10, 20, 50 and 100 “topics.” – TAGME: Wikipedia annotation to text 32 Content-Based Blocking and Tagging • Tag quality is further guaranteed by removing noisy tags by threshold cut-off. TAGME NMF with 100 topics 33 Content-Based Blocking and Tagging CBAC Threshold Query1 With Tagging ABAC+CBAC Threshold Query1 With Tagging CBAC Threshold Query2 With Tagging ABAC+CBAC Threshold Query2 With Tagging 34 CBAC Top-10 with Tagging Content-Based Blocking and Tagging CBAC Top-10 with Blocking + Tagging 35 Content-Based Blocking and Tagging • Soundness of CBAC 36 Outline • Introduction • The Content-Based Access Control Model • Content-Based Access Control Enforcement • Content-Based Blocking and Tagging • Multi-Label Learning • Discussion and Conclusion 37 Multi-Label Learning • Motivation – We learnt curated annotation with domain knowledge provides accurate annotation and boosts efficiency. – The question come: if the domain is not supported by Wikipedia database (e.g. ylw banana mapped to food, fruit, snack), where should we get the topic annotation. – Multi-Label Learning is able to learn a subset of labeled sample and predict the labels for the rest samples, which facilitates the topic (label) annotation – We will dive into multi-label learning. 38 Multi-Label Learning Multi-Component Labels Multi-Facet Labels Hierarchical Labels Women’s Apparel Tops Painter, Sculptor, Architect Musician, Mathematician, Engineer Inventor, Anatomist, Geologist Cartographer, Botanist, Writer Silk Tops Silk Short Sleeve 39 Multi-Label Learning • Ambiguity of Labels 40 Multi-Label Learning • Ambiguity of Labels Label Correlation helps to eliminate ambiguity of labels 41 Multi-Label Learning • Uneven Label Distribution leads to Imbalance Problem Fruit Grape Multi-Label Learning • Pros & Cons of current problem transformations for MLL – Binary relevance treats MLL as a bunch of binary classifiers • Pros: Simple, Easy to Parallelize • Cons: Totally neglects the inner dependent relationships among labels – Power-set label introduces a pre-step to construct a power set labels to be the new labels • Pros: Codes in the correlation among labels • Cons: Introduces false correlations Deteriorates imbalance problem 43 Multi-Label Learning • Observations & Assumptions – Observations: • Error-correction coding methods is able to boost the accuracy in detecting binary strings – Assumptions: • Error-correction coding brings in calibration to correct the mis-classified samples by adding new digits to the end of the binary strings. • The added new digits should be located as far as possible for different messages to remove the ambiguity. • The added new digits should be balanced in the entire binary string sets to correct the errors due to imbalance problem. 44 Multi-Label Learning • Preliminary T = {X,Y }, X Ì R d ,Y Ì B K – Training Data: – Unique Label Vectors: U Ì B m´K – Occurrence Weight Vector: w Ì Rm – Pseudo Label Set: Z Ì B m´p – Objective Function: Let b =11´m ´ ((w ´11´p )× Z) Q = åå ZZ T + åå Z T Z + l × bb T 45 Multi-Label Learning • Algorithms – Generate pseudo labels for training data – Perform binary relevance transformation – Make individual prediction on different labels and pseudo labels for testing data – Calibrate the prediction with pseudo labels 46 Multi-Label Learning • Data Sets Diverse domains of multi-label data sets are pulled out for the experiment. Generally, they are all imbalanced 47 Multi-Label Learning • Experiment – SVM with linear and radius kernels, and Random Forest are chosen as our binary relevance classifiers. – BPL versions of binary relevance MLL outperform naïve binary relevance methods. 48 Multi-Label Learning BPL Outperforms other State-of-Arts in Macro-Averaging F1 49 Multi-Label Learning BPL Outperforms other State-of-Arts in Micro-Averaging F1 50 Multi-Label Learning BPL Outperforms other State-of-Arts in Subset Accuracy 51 Outline • Introduction • The Content-Based Access Control Model • Content-Based Access Control Enforcement • Content-Based Blocking and Tagging • Multi-Label Learning • Discussion and Conclusion 52 Discussion and Conclusion • Computational complexity: • CBAC: O(N*m*t). N: number of records; m: size of base set; t: size of the dictionary (TF/IDF) • CBAC with multi-level blocking: O(m*t*log(N*x)). x: number of clusters 53 Discussion and Conclusion • Parallelization of CBAC: – CBAC, blocking and labeling processes could be parallelized. – May work with map-reduce. • Overall, CBAC requires a reasonable overhead. It is scalable. 54 Discussion and Conclusion • Security Analysis – CBAC != Relaxed security: content-based access control does not imply weakened or relaxed security. – Rather, it enforces an additional layer of access control on top of existing “precise” access control methods. Security Guarantee when CBAC is correctly enforced and managed, a malicious user cannot obtain access to sensitive information by manipulating his/her accessible records, creating spoofing records, or gaining (non-base-set) access to similar insensitive information.. 55 Discussion and Conclusion • Conclusion – CBAC is an access control model focusing on protecting data content. – CBAC makes access control decision based on the semantic similarity between requester’s credentials and the content of rest data in database. – Applying offline CBAC to databases not updating that frequently is efficient. – With optimization on CBAC enforcement, there is a little overhead compared to query without CBAC – Multi-label learning will be functional when curated database anotation is not available. 56 Questions 57 58