Content-Based Access Control
Wenrong Zeng
wrzeng@ittc.ku.edu
Dept. of Electrical Engineering and Computer Science
Advisor: Dr. Luo
Committee Members:
Dr. Agah
Dr. Grzymala-Busse
Dr. Kulkarni
Dr. Ho
Acknowledgements
I owe my thanks to my committee members:
• Dr. Luo
• Dr. Agah
• Dr. Grzymala-Busse
• Dr. Kulkarni
• Dr. Ho
1
Outline
• Introduction
• The Content-Based Access Control Model
• Content-Based Access Control Enforcement
• Content-Based Blocking and Tagging
• Multi-Label Learning
• Discussion and Conclusion
2
Outline
• Introduction
• The Content-Based Access Control Model
• Content-Based Access Control Enforcement
• Content-Based Blocking and Tagging
• Multi-Label Learning
• Discussion and Conclusion
3
Introduction
• Education Background & Publication
• Motivation & Goal
• Background & Related Work
4
Education Background
• B. S.
– Major:
• M. E.
– Major:
Peking University
Electrical Engineering
Chinese Academy of Sciences
2009
Computer Science
• PhD student
– Major:
2006
University of Kansas
Computer Science
5
Present
Publication
– Wenrong Zeng, Yuhao Yang, Bo Luo: Using Data Content to Assist Access
Control for Large-Scale Content-Centric Databases. In IEEE International
Conference on Big Data (IEEE BigData), 2014 (Acceptance rate: 18.5%)
– Wenrong Zeng, Xuewen Chen, Hong Cheng: Pseudo labels for imbalanced
multi-label learning. DSAA 2014: 25-31.
– Wenrong Zeng, Yuhao Yang, Bo Luo: Access control for big data using data
content. In IEEE International Conference on Big Data (IEEE BigData),
2014: 45-47 (Poster).
– Wenrong Zeng, Xuewen Chen, Hong Cheng and Jing Hua, Multi-Space
Learning for Image Classication Using AdaBoost and Markov Random
Fields, Solving Comeplex Machine Learning Problems with Ensemble
Methods Workshop, 2013.
– Yi Jia, Wenrong Zeng, Jun Huan: Non-stationary bayesian networks based
on perfect simulation. In ACM Conference on Information and Knowledge
Management, 2012: 1095-1104. (Acceptance rate: 13.4%)
– Weiming Hu, Dan Xie, Zhouyu Fu, Wenrong Zeng, Stephen J. Maybank:
Semantic-Based Surveillance Video Retrieval. IEEE Transactions on Image
Processing 16(4): 1168-1181 (2007).
6
Motivation
• Data tends to be content-centric.
– Healthcare: 500 million patient databases nationwide.
– Telecom: Largest Volume of one unique database: 312 TB comprises
AT&T’s calling records.
– Business: By 2020, business transactions data on internet will reach 450
billion per day. (IDC)
• Nearly Every Field faces Big Data Issue
Volume
Variety
Velocity
http://stablekernel.com/blog/wp-content/uploads/2015/02/Big-Data.jpg
7
Motivation
• With big data, conventional database access control
mechanisms may be insufficient.
• Long term goal: smart access control decisions for big data
without extensive labor of the DBA.
http://blog.varonis.com/big-data-security/
8
Motivation
• Example
– A law enforcement agency (e.g. FBI) holds a database of
highly sensitive case records.
• Large amount of records
• Unstructured content
– A supervisor assigns a case to agent Alice for investigation.
– Naturally, the supervisor also needs to grant Alice access to
all related or similar cases.
9
Motivation
• Manual assignment
– The supervisor manually selects “related cases”.
– Extremely labor intensive, practically impossible
• Multi-level security
– Alice can access all the cases with equal or lower security
levels.
– Over privileged users!
• Attribute based access control
– E.g. Alice can access all the robbery records within 5 years,
in Area X, in which the suspect is 6-foot tall.
– Attributes require manual input, usually not available.
10
Goals
Smart access control decision.
• Develop content-based access control model, which
is data-driven.
• Enforce content-based access control model
efficiently.
• Assumptions:
– Basic privileges: users are authenticated with basic trust
(e.g. with MLS)
– Data-driven: large amounts of content-centric data, access
control model must be data-driven.
– Lack of explicit authorization
– Approximation is allowed
11
Conventional Methods
• Role-Based Access Control:
Bob, an adult, can drink wine.
sbj.
role
obj.
• Attribute-Based Access Control:
Bob, who is 24 years old, can drink
wine.
sbj.
obj.
age attribute
12
Current Issues
• Difficult to define granular access controls.
• Lack the ability to implement abstract access control
policies (e.g. Similar documents)
• Access control models are NOT content-driven.
“A truly comprehensive approach for data protection
must include mechanisms for enforcing access control
policies based on data contents ….”
E. Bertino, et. al. Access Control for Databases: Concepts and Systems. Vol. 3. No. 1-2.
Now Publishers Inc, 2011.
13
Text Feature Extraction
– TF-IDF: Term Frequency Inverse Document Frequency
N
tfidf (t, d, D) = tf (t, d) ´ log
| {d Î D : t Î d } |
– Topic Modeling: Non-negative Matrix Factorization Based on
TF-IDF
They are both innately term-distributed features
14
Text Feature Extraction
– Where Term-distributed Features Fall Short!
D1: privacy preserving similarity assessment
for semi-structured data
D2: private XML document matching
According to TF-IDF, the cosine similarity of D1 and D2 is 0.
15
Text Feature Extraction
– TAGME: Topic Modeling with Wikipedia Curated Annotation
Doc No.
Word(s)
Topic Annotation
Weight
D1
privacy
Privacy
0.1279
preserving
Historic preservation
0.0017
similarity
Homology (biology)
0.0354
assessment
Homology (biology)
0.0521
semi-structured data
Semi-structured data
0.2727
private
Privacy
0.1256
XML
XML
0.5375
document
Document management system
0.1475
matching
Matching principle
0.0509
for
D2
16
Outline
• Introduction
• The Content-Based Access Control Model
• Content-Based Access Control Enforcement
• Content-Based Blocking and Tagging
• Multi-Label Learning
• Discussion and Conclusion
17
The Content-Based
Access Control Model
• Two-phase model
– Initial authorization
– Content-based authorization
• Initial authorization
– Conventional access control policy
ACR = [subject, object, action, sign]
– Each CBAC-user is explicitly grant access to a small set of
records: seed set.
• Manual selection
• Attribute-based rules
• Requested by the user
18
The Content-Based
Access Control Model
• Content-based authorization
– Content-based access control policy
ACR =[subject, object, action, f (s, di )]
– Dynamic “sign” function
f (s, di ) = (max(SIMd (s j , di )) > T )
j
– To be evaluated on-the-fly
– {true, false} based on content similarity between the base set
and the object record
– Similarity function
M
SIM d (di , d j ) = åw x ´ simax (di,x , d j,x )
x=1
19
The Content-Based
Access Control Model
• Content modeling
– In SIM d (d i , d j ) =
M
åw
x
´ simax (di,x , d j,x )
x=1
similarity function defined for term or topic ax
– Unstructured text attributes (CLOB, Text)
• Any text modeling approach could be used
• We utilized the vector space model (TF/IDF) in Oracle.
20
Outline
• Introduction
• The Content-Based Access Control Model
• Content-Based Access Control Enforcement
• Content-Based Blocking and Tagging
• Multi-Label Learning
• Discussion and Conclusion
21
Content-Based
Access Control Enforcement
• Settings
– UCI KDD NSF award data: abstracts represent the contentrich information
– Use MIT’s SCIgen to add approximately 20x noise data:
2.7M records
– Base set: awards
PI-ed by the user
– CBAC enforced with
Oracle Virtual Private
Database
– tables as follows
22
Content-Based
Access Control Enforcement
• Experiment
– The database runs on a 64-bit Windows 7 system, with Intel
R CoreTM 2 Duo CPU E8500 @ 3.16GHz and 4.0GB RAM
– Login as 60 randomly selected users to issue the following
queries via PL/SQL:
23
Content-Based
Access Control Enforcement
• Experiment
– Three different scenarios for access control:
• (R1) an attribute-based access control (ABAC) rule: the user is allowed
to access records in a division where he/she has PI-ed an award
• (R2) a content-based access control (CBAC) rule: the user is only
allowed to access awards that have similar abstracts with the awards in
his/her base set; and (R3) a combined
• (ABAC+CBAC) rule: R1 AND R2.
24
Content-Based
Access Control Enforcement
• Basic On-the-Fly CBAC Threshold Results
ABAC Query1
ABAC Query2
CBAC Threshold Query1
CBAC Threshold Query2
25
ABAC+CBAC Threshold Query1
ABAC+CBAC Threshold Query2
Content-Based
Access Control Enforcement
• Basic On-the-Fly Top-10 CBAC Results
Offline CBAC Results
26
Content-Based
Access Control Enforcement
• Issues with CBAC:
– Efficiency: content-based similarity assessment is slow
– Accuracy: vector space model suffers from lexical ambiguity,
especially for short text snippets (e.g. tweet messages)
27
Outline
• Introduction
• The Content-Based Access Control Model
• Content-Based Access Control Enforcement
• Content-Based Blocking and Tagging
• Multi-Label Learning
• Discussion and Conclusion
• Your Input
28
Content-Based
Blocking and Tagging
• Content-based Blocking:
– Pre-partition records into semantically similar clusters
– Base set s is first compared with class centroids
– Query is only evaluated against top x clusters
Before Blocking
29
29
Content-Based
Blocking and Tagging
After Blocking
30
Content-Based
Blocking and Tagging
CBAC Threshold Query1
with Blocking
ABAC+CBAC Threshold Query1
with Blocking
CBAC Threshold Query2
with Blocking
ABAC+CBAC Threshold Query2
with Blocking
31
CBAC Top-10 with Blocking
Content-Based
Blocking and Tagging
• Data annotation is performed off-line. Efficiency is not
an issue
• We use:
– Non-negative Matrix Factorization with 10, 20, 50 and 100
“topics.”
– TAGME: Wikipedia annotation to text
32
Content-Based
Blocking and Tagging
• Tag quality is further guaranteed by removing noisy
tags by threshold cut-off.
TAGME
NMF with 100 topics
33
Content-Based
Blocking and Tagging
CBAC Threshold Query1
With Tagging
ABAC+CBAC Threshold Query1
With Tagging
CBAC Threshold Query2
With Tagging
ABAC+CBAC Threshold Query2
With Tagging
34
CBAC Top-10 with Tagging
Content-Based
Blocking and Tagging
CBAC Top-10 with Blocking + Tagging
35
Content-Based
Blocking and Tagging
• Soundness of CBAC
36
Outline
• Introduction
• The Content-Based Access Control Model
• Content-Based Access Control Enforcement
• Content-Based Blocking and Tagging
• Multi-Label Learning
• Discussion and Conclusion
37
Multi-Label Learning
• Motivation
– We learnt curated annotation with domain knowledge
provides accurate annotation and boosts efficiency.
– The question come: if the domain is not supported by
Wikipedia database (e.g. ylw banana mapped to food, fruit,
snack), where should we get the topic annotation.
– Multi-Label Learning is able to learn a subset of labeled
sample and predict the labels for the rest samples, which
facilitates the topic (label) annotation
– We will dive into multi-label learning.
38
Multi-Label Learning
Multi-Component
Labels
Multi-Facet
Labels
Hierarchical
Labels
Women’s Apparel
Tops
Painter, Sculptor, Architect
Musician, Mathematician, Engineer
Inventor, Anatomist, Geologist
Cartographer, Botanist, Writer
Silk Tops
Silk Short Sleeve
39
Multi-Label Learning
•
Ambiguity of Labels
40
Multi-Label Learning
•
Ambiguity of Labels
Label Correlation helps to eliminate ambiguity of labels
41
Multi-Label Learning
•
Uneven Label Distribution leads to Imbalance
Problem
Fruit
Grape
Multi-Label Learning
• Pros & Cons of current problem transformations for
MLL
– Binary relevance treats MLL as a bunch of binary classifiers
• Pros:
Simple, Easy to Parallelize
• Cons: Totally neglects the inner dependent relationships among labels
– Power-set label introduces a pre-step to construct a power
set labels to be the new labels
•
Pros: Codes in the correlation among labels
• Cons: Introduces false correlations
Deteriorates imbalance problem
43
Multi-Label Learning
• Observations & Assumptions
– Observations:
• Error-correction coding methods is able to boost the accuracy in
detecting binary strings
– Assumptions:
• Error-correction coding brings in calibration to correct the mis-classified
samples by adding new digits to the end of the binary strings.
• The added new digits should be located as far as possible for different
messages to remove the ambiguity.
• The added new digits should be balanced in the entire binary string
sets to correct the errors due to imbalance problem.
44
Multi-Label Learning
• Preliminary
T = {X,Y }, X Ì R d ,Y Ì B K
– Training Data:
– Unique Label Vectors:
U Ì B m´K
– Occurrence Weight Vector:
w Ì Rm
– Pseudo Label Set:
Z Ì B m´p
– Objective Function:
Let
b =11´m ´ ((w ´11´p )× Z)
Q = åå ZZ T + åå Z T Z + l × bb T
45
Multi-Label Learning
• Algorithms
– Generate pseudo labels for training data
– Perform binary relevance transformation
– Make individual prediction on different labels and pseudo
labels for testing data
– Calibrate the prediction with pseudo labels
46
Multi-Label Learning
• Data Sets
Diverse domains of multi-label data sets are pulled out for the experiment.
Generally, they are all imbalanced
47
Multi-Label Learning
• Experiment
– SVM with linear and radius kernels, and Random Forest are
chosen as our binary relevance classifiers.
– BPL versions of binary relevance MLL outperform naïve
binary relevance methods.
48
Multi-Label Learning
BPL Outperforms other State-of-Arts in Macro-Averaging F1
49
Multi-Label Learning
BPL Outperforms other State-of-Arts in Micro-Averaging F1
50
Multi-Label Learning
BPL Outperforms other State-of-Arts in Subset Accuracy
51
Outline
• Introduction
• The Content-Based Access Control Model
• Content-Based Access Control Enforcement
• Content-Based Blocking and Tagging
• Multi-Label Learning
• Discussion and Conclusion
52
Discussion and Conclusion
• Computational complexity:
• CBAC: O(N*m*t). N: number of records; m: size of base
set; t: size of the dictionary (TF/IDF)
• CBAC with multi-level blocking: O(m*t*log(N*x)). x:
number of clusters
53
Discussion and Conclusion
• Parallelization of CBAC:
– CBAC, blocking and labeling processes could be
parallelized.
– May work with map-reduce.
• Overall, CBAC requires a reasonable overhead. It is
scalable.
54
Discussion and Conclusion
• Security Analysis
– CBAC != Relaxed security: content-based access control
does not imply weakened or relaxed security.
– Rather, it enforces an additional layer of access control on
top of existing “precise” access control methods.
Security Guarantee
when CBAC is correctly enforced and managed, a
malicious user cannot obtain access to sensitive
information by manipulating his/her accessible records,
creating spoofing records, or gaining (non-base-set)
access to similar insensitive information..
55
Discussion and Conclusion
• Conclusion
– CBAC is an access control model focusing on protecting
data content.
– CBAC makes access control decision based on the semantic
similarity between requester’s credentials and the content of
rest data in database.
– Applying offline CBAC to databases not updating that
frequently is efficient.
– With optimization on CBAC enforcement, there is a little
overhead compared to query without CBAC
– Multi-label learning will be functional when curated database
anotation is not available.
56
Questions
57
58
0
You can add this document to your study collection(s)
Sign in Available only to authorized usersYou can add this document to your saved list
Sign in Available only to authorized users(For complaints, use another form )