Content-Based Access Control

advertisement
Content-Based Access Control
Wenrong Zeng
wrzeng@ittc.ku.edu
Dept. of Electrical Engineering and Computer Science
Advisor: Dr. Luo
Committee Members:
Dr. Agah
Dr. Grzymala-Busse
Dr. Kulkarni
Dr. Ho
Acknowledgements
I owe my thanks to my committee members:
• Dr. Luo
• Dr. Agah
• Dr. Grzymala-Busse
• Dr. Kulkarni
• Dr. Ho
1
Outline
• Introduction
• The Content-Based Access Control Model
• Content-Based Access Control Enforcement
• Content-Based Blocking and Tagging
• Multi-Label Learning
• Discussion and Conclusion
2
Outline
• Introduction
• The Content-Based Access Control Model
• Content-Based Access Control Enforcement
• Content-Based Blocking and Tagging
• Multi-Label Learning
• Discussion and Conclusion
3
Introduction
• Education Background & Publication
• Motivation & Goal
• Background & Related Work
4
Education Background
• B. S.
– Major:
• M. E.
– Major:
Peking University
Electrical Engineering
Chinese Academy of Sciences
2009
Computer Science
• PhD student
– Major:
2006
University of Kansas
Computer Science
5
Present
Publication
– Wenrong Zeng, Yuhao Yang, Bo Luo: Using Data Content to Assist Access
Control for Large-Scale Content-Centric Databases. In IEEE International
Conference on Big Data (IEEE BigData), 2014 (Acceptance rate: 18.5%)
– Wenrong Zeng, Xuewen Chen, Hong Cheng: Pseudo labels for imbalanced
multi-label learning. DSAA 2014: 25-31.
– Wenrong Zeng, Yuhao Yang, Bo Luo: Access control for big data using data
content. In IEEE International Conference on Big Data (IEEE BigData),
2014: 45-47 (Poster).
– Wenrong Zeng, Xuewen Chen, Hong Cheng and Jing Hua, Multi-Space
Learning for Image Classication Using AdaBoost and Markov Random
Fields, Solving Comeplex Machine Learning Problems with Ensemble
Methods Workshop, 2013.
– Yi Jia, Wenrong Zeng, Jun Huan: Non-stationary bayesian networks based
on perfect simulation. In ACM Conference on Information and Knowledge
Management, 2012: 1095-1104. (Acceptance rate: 13.4%)
– Weiming Hu, Dan Xie, Zhouyu Fu, Wenrong Zeng, Stephen J. Maybank:
Semantic-Based Surveillance Video Retrieval. IEEE Transactions on Image
Processing 16(4): 1168-1181 (2007).
6
Motivation
• Data tends to be content-centric.
– Healthcare: 500 million patient databases nationwide.
– Telecom: Largest Volume of one unique database: 312 TB comprises
AT&T’s calling records.
– Business: By 2020, business transactions data on internet will reach 450
billion per day. (IDC)
• Nearly Every Field faces Big Data Issue
Volume
Variety
Velocity
http://stablekernel.com/blog/wp-content/uploads/2015/02/Big-Data.jpg
7
Motivation
• With big data, conventional database access control
mechanisms may be insufficient.
• Long term goal: smart access control decisions for big data
without extensive labor of the DBA.
http://blog.varonis.com/big-data-security/
8
Motivation
• Example
– A law enforcement agency (e.g. FBI) holds a database of
highly sensitive case records.
• Large amount of records
• Unstructured content
– A supervisor assigns a case to agent Alice for investigation.
– Naturally, the supervisor also needs to grant Alice access to
all related or similar cases.
9
Motivation
• Manual assignment
– The supervisor manually selects “related cases”.
– Extremely labor intensive, practically impossible
• Multi-level security
– Alice can access all the cases with equal or lower security
levels.
– Over privileged users!
• Attribute based access control
– E.g. Alice can access all the robbery records within 5 years,
in Area X, in which the suspect is 6-foot tall.
– Attributes require manual input, usually not available.
10
Goals
Smart access control decision.
• Develop content-based access control model, which
is data-driven.
• Enforce content-based access control model
efficiently.
• Assumptions:
– Basic privileges: users are authenticated with basic trust
(e.g. with MLS)
– Data-driven: large amounts of content-centric data, access
control model must be data-driven.
– Lack of explicit authorization
– Approximation is allowed
11
Conventional Methods
• Role-Based Access Control:
Bob, an adult, can drink wine.
sbj.
role
obj.
• Attribute-Based Access Control:
Bob, who is 24 years old, can drink
wine.
sbj.
obj.
age attribute
12
Current Issues
• Difficult to define granular access controls.
• Lack the ability to implement abstract access control
policies (e.g. Similar documents)
• Access control models are NOT content-driven.
“A truly comprehensive approach for data protection
must include mechanisms for enforcing access control
policies based on data contents ….”
E. Bertino, et. al. Access Control for Databases: Concepts and Systems. Vol. 3. No. 1-2.
Now Publishers Inc, 2011.
13
Text Feature Extraction
– TF-IDF: Term Frequency Inverse Document Frequency
N
tfidf (t, d, D) = tf (t, d) ´ log
| {d Î D : t Î d } |
– Topic Modeling: Non-negative Matrix Factorization Based on
TF-IDF
They are both innately term-distributed features
14
Text Feature Extraction
– Where Term-distributed Features Fall Short!
D1: privacy preserving similarity assessment
for semi-structured data
D2: private XML document matching
According to TF-IDF, the cosine similarity of D1 and D2 is 0.
15
Text Feature Extraction
– TAGME: Topic Modeling with Wikipedia Curated Annotation
Doc No.
Word(s)
Topic Annotation
Weight
D1
privacy
Privacy
0.1279
preserving
Historic preservation
0.0017
similarity
Homology (biology)
0.0354
assessment
Homology (biology)
0.0521
semi-structured data
Semi-structured data
0.2727
private
Privacy
0.1256
XML
XML
0.5375
document
Document management system
0.1475
matching
Matching principle
0.0509
for
D2
16
Outline
• Introduction
• The Content-Based Access Control Model
• Content-Based Access Control Enforcement
• Content-Based Blocking and Tagging
• Multi-Label Learning
• Discussion and Conclusion
17
The Content-Based
Access Control Model
• Two-phase model
– Initial authorization
– Content-based authorization
• Initial authorization
– Conventional access control policy
ACR = [subject, object, action, sign]
– Each CBAC-user is explicitly grant access to a small set of
records: seed set.
• Manual selection
• Attribute-based rules
• Requested by the user
18
The Content-Based
Access Control Model
• Content-based authorization
– Content-based access control policy
ACR =[subject, object, action, f (s, di )]
– Dynamic “sign” function
f (s, di ) = (max(SIMd (s j , di )) > T )
j
– To be evaluated on-the-fly
– {true, false} based on content similarity between the base set
and the object record
– Similarity function
M
SIM d (di , d j ) = åw x ´ simax (di,x , d j,x )
x=1
19
The Content-Based
Access Control Model
• Content modeling
– In SIM d (d i , d j ) =
M
åw
x
´ simax (di,x , d j,x )
x=1
similarity function defined for term or topic ax
– Unstructured text attributes (CLOB, Text)
• Any text modeling approach could be used
• We utilized the vector space model (TF/IDF) in Oracle.
20
Outline
• Introduction
• The Content-Based Access Control Model
• Content-Based Access Control Enforcement
• Content-Based Blocking and Tagging
• Multi-Label Learning
• Discussion and Conclusion
21
Content-Based
Access Control Enforcement
• Settings
– UCI KDD NSF award data: abstracts represent the contentrich information
– Use MIT’s SCIgen to add approximately 20x noise data:
2.7M records
– Base set: awards
PI-ed by the user
– CBAC enforced with
Oracle Virtual Private
Database
– tables as follows
22
Content-Based
Access Control Enforcement
• Experiment
– The database runs on a 64-bit Windows 7 system, with Intel
R CoreTM 2 Duo CPU E8500 @ 3.16GHz and 4.0GB RAM
– Login as 60 randomly selected users to issue the following
queries via PL/SQL:
23
Content-Based
Access Control Enforcement
• Experiment
– Three different scenarios for access control:
• (R1) an attribute-based access control (ABAC) rule: the user is allowed
to access records in a division where he/she has PI-ed an award
• (R2) a content-based access control (CBAC) rule: the user is only
allowed to access awards that have similar abstracts with the awards in
his/her base set; and (R3) a combined
• (ABAC+CBAC) rule: R1 AND R2.
24
Content-Based
Access Control Enforcement
• Basic On-the-Fly CBAC Threshold Results
ABAC Query1
ABAC Query2
CBAC Threshold Query1
CBAC Threshold Query2
25
ABAC+CBAC Threshold Query1
ABAC+CBAC Threshold Query2
Content-Based
Access Control Enforcement
• Basic On-the-Fly Top-10 CBAC Results
Offline CBAC Results
26
Content-Based
Access Control Enforcement
• Issues with CBAC:
– Efficiency: content-based similarity assessment is slow
– Accuracy: vector space model suffers from lexical ambiguity,
especially for short text snippets (e.g. tweet messages)
27
Outline
• Introduction
• The Content-Based Access Control Model
• Content-Based Access Control Enforcement
• Content-Based Blocking and Tagging
• Multi-Label Learning
• Discussion and Conclusion
• Your Input
28
Content-Based
Blocking and Tagging
• Content-based Blocking:
– Pre-partition records into semantically similar clusters
– Base set s is first compared with class centroids
– Query is only evaluated against top x clusters
Before Blocking
29
29
Content-Based
Blocking and Tagging
After Blocking
30
Content-Based
Blocking and Tagging
CBAC Threshold Query1
with Blocking
ABAC+CBAC Threshold Query1
with Blocking
CBAC Threshold Query2
with Blocking
ABAC+CBAC Threshold Query2
with Blocking
31
CBAC Top-10 with Blocking
Content-Based
Blocking and Tagging
• Data annotation is performed off-line. Efficiency is not
an issue
• We use:
– Non-negative Matrix Factorization with 10, 20, 50 and 100
“topics.”
– TAGME: Wikipedia annotation to text
32
Content-Based
Blocking and Tagging
• Tag quality is further guaranteed by removing noisy
tags by threshold cut-off.
TAGME
NMF with 100 topics
33
Content-Based
Blocking and Tagging
CBAC Threshold Query1
With Tagging
ABAC+CBAC Threshold Query1
With Tagging
CBAC Threshold Query2
With Tagging
ABAC+CBAC Threshold Query2
With Tagging
34
CBAC Top-10 with Tagging
Content-Based
Blocking and Tagging
CBAC Top-10 with Blocking + Tagging
35
Content-Based
Blocking and Tagging
• Soundness of CBAC
36
Outline
• Introduction
• The Content-Based Access Control Model
• Content-Based Access Control Enforcement
• Content-Based Blocking and Tagging
• Multi-Label Learning
• Discussion and Conclusion
37
Multi-Label Learning
• Motivation
– We learnt curated annotation with domain knowledge
provides accurate annotation and boosts efficiency.
– The question come: if the domain is not supported by
Wikipedia database (e.g. ylw banana mapped to food, fruit,
snack), where should we get the topic annotation.
– Multi-Label Learning is able to learn a subset of labeled
sample and predict the labels for the rest samples, which
facilitates the topic (label) annotation
– We will dive into multi-label learning.
38
Multi-Label Learning
Multi-Component
Labels
Multi-Facet
Labels
Hierarchical
Labels
Women’s Apparel
Tops
Painter, Sculptor, Architect
Musician, Mathematician, Engineer
Inventor, Anatomist, Geologist
Cartographer, Botanist, Writer
Silk Tops
Silk Short Sleeve
39
Multi-Label Learning
•
Ambiguity of Labels
40
Multi-Label Learning
•
Ambiguity of Labels
Label Correlation helps to eliminate ambiguity of labels
41
Multi-Label Learning
•
Uneven Label Distribution leads to Imbalance
Problem
Fruit
Grape
Multi-Label Learning
• Pros & Cons of current problem transformations for
MLL
– Binary relevance treats MLL as a bunch of binary classifiers
• Pros:
Simple, Easy to Parallelize
• Cons: Totally neglects the inner dependent relationships among labels
– Power-set label introduces a pre-step to construct a power
set labels to be the new labels
•
Pros: Codes in the correlation among labels
• Cons: Introduces false correlations
Deteriorates imbalance problem
43
Multi-Label Learning
• Observations & Assumptions
– Observations:
• Error-correction coding methods is able to boost the accuracy in
detecting binary strings
– Assumptions:
• Error-correction coding brings in calibration to correct the mis-classified
samples by adding new digits to the end of the binary strings.
• The added new digits should be located as far as possible for different
messages to remove the ambiguity.
• The added new digits should be balanced in the entire binary string
sets to correct the errors due to imbalance problem.
44
Multi-Label Learning
• Preliminary
T = {X,Y }, X Ì R d ,Y Ì B K
– Training Data:
– Unique Label Vectors:
U Ì B m´K
– Occurrence Weight Vector:
w Ì Rm
– Pseudo Label Set:
Z Ì B m´p
– Objective Function:
Let
b =11´m ´ ((w ´11´p )× Z)
Q = åå ZZ T + åå Z T Z + l × bb T
45
Multi-Label Learning
• Algorithms
– Generate pseudo labels for training data
– Perform binary relevance transformation
– Make individual prediction on different labels and pseudo
labels for testing data
– Calibrate the prediction with pseudo labels
46
Multi-Label Learning
• Data Sets
Diverse domains of multi-label data sets are pulled out for the experiment.
Generally, they are all imbalanced
47
Multi-Label Learning
• Experiment
– SVM with linear and radius kernels, and Random Forest are
chosen as our binary relevance classifiers.
– BPL versions of binary relevance MLL outperform naïve
binary relevance methods.
48
Multi-Label Learning
BPL Outperforms other State-of-Arts in Macro-Averaging F1
49
Multi-Label Learning
BPL Outperforms other State-of-Arts in Micro-Averaging F1
50
Multi-Label Learning
BPL Outperforms other State-of-Arts in Subset Accuracy
51
Outline
• Introduction
• The Content-Based Access Control Model
• Content-Based Access Control Enforcement
• Content-Based Blocking and Tagging
• Multi-Label Learning
• Discussion and Conclusion
52
Discussion and Conclusion
• Computational complexity:
• CBAC: O(N*m*t). N: number of records; m: size of base
set; t: size of the dictionary (TF/IDF)
• CBAC with multi-level blocking: O(m*t*log(N*x)). x:
number of clusters
53
Discussion and Conclusion
• Parallelization of CBAC:
– CBAC, blocking and labeling processes could be
parallelized.
– May work with map-reduce.
• Overall, CBAC requires a reasonable overhead. It is
scalable.
54
Discussion and Conclusion
• Security Analysis
– CBAC != Relaxed security: content-based access control
does not imply weakened or relaxed security.
– Rather, it enforces an additional layer of access control on
top of existing “precise” access control methods.
Security Guarantee
when CBAC is correctly enforced and managed, a
malicious user cannot obtain access to sensitive
information by manipulating his/her accessible records,
creating spoofing records, or gaining (non-base-set)
access to similar insensitive information..
55
Discussion and Conclusion
• Conclusion
– CBAC is an access control model focusing on protecting
data content.
– CBAC makes access control decision based on the semantic
similarity between requester’s credentials and the content of
rest data in database.
– Applying offline CBAC to databases not updating that
frequently is efficient.
– With optimization on CBAC enforcement, there is a little
overhead compared to query without CBAC
– Multi-label learning will be functional when curated database
anotation is not available.
56
Questions
57
58
Download