Journal of Computer Security 22 (2014) 1–31
DOI 10.3233/JCS-130484
IOS Press
1
An optimization framework for role mining
Haibing Lu a,∗ , Jaideep Vaidya b and Vijayalakshmi Atluri b
a OMIS, Santa Clara University, Santa Clara, CA, USA
E-mail: hlu@scu.edu
b MSIS, Rutgers University, Newark, NJ, USA
E-mails: {jsvaidya, atluri}@cimic.rutgers.edu
Role Based Access Control (RBAC) is accepted as the de facto access control model for organizations
of all sizes. However, engineering the right set of roles is crucial to enable the correct deployment of RBAC
within an organization. Indeed, discovering an optimal and correct set of roles from existing permission
assignments, referred to as the role mining problem (RMP), has gained significant attention in recent
years. Role Mining is itself an instantiation of Boolean matrix decomposition – wherein a Boolean matrix
is decomposed into two Boolean matrices giving a set of basis vectors and their appropriate combination. In fact, such decompositions are useful in a number of application domains beyond role engineering,
including text mining as well as knowledge discovery. While a Boolean matrix can be decomposed in
many ways, however, certain decompositions better characterize the semantics associated with the original matrix in a succinct but comprehensive way. Indeed, one can find different decompositions that are
optimal with respect to different criteria that may match various semantics. In this paper, we first present
a number of variants of the optimal Boolean matrix decomposition problem, including usage RMP, basic
RMP, δ-approximate RMP, and edge RMP, that have pragmatic implications in the context of role mining. We then present a unified framework for modeling the optimal Boolean matrix decomposition and
its variants using integer linear programming (ILP). Such modeling allows us to directly adopt the huge
body of heuristic solutions and tools developed for integer linear programming. We also develop efficient heuristics and solutions for each RMP variant, and validate them by a comprehensive experimental
evaluation.
Keywords: Role based access control, role mining, Boolean matrix decomposition, integer linear
programming
1. Introduction
Role based Access Control (RBAC) is widely considered as the de-facto standard
used for advanced access control. Unlike traditional access control wherein permissions are directly assigned to users, in RBAC permissions are assigned to roles and
in turn roles are assigned to users. The main benefit of RBAC is to reduce the load
of administration, since the number of required roles is often orders of magnitude
lower than the number of users/permissions. However, to realize the full potential of
RBAC, it is necessary to define the complete and correct set of roles. This process
of defining roles and user-role assignments for access control is known as Role en* Corresponding author. E-mail: hlu@scu.edu.
0926-227X/14/$27.50 © 2014 – IOS Press and the authors. All rights reserved
2
H. Lu et al. / An optimization framework for role mining
gineering [2], and has been identified as the costliest component in realizing RBAC
by National Institute of Standard and Technology (NIST) [6]. Recently, automated
techniques have been developed for bottom-up role discovery and assignment. This
process of identifying the roles from the existing user-to-permission assignments is
referred to as the role mining problem (RMP) [34], and has been the focus of significant research.
Note that user-to-permission assignments can be formulated as a Boolean matrix,
with 1 at entry (i, j) denoting user i is assigned permission j. Similarly user-to-role
assignments and role-to-permission mappings can also be formulated as Boolean
matrices. Thus, Role Mining can be viewed as an instantiation of Boolean matrix
decomposition (BMD). In BMD, a Boolean matrix Am×n is decomposed into two
Boolean matrices Bm×k and Ck×n . However, instead of ordinary matrix product,
Am×n is decomposed as a Boolean matrix product of Bm×k and Ck×n . Now, the
rows of Ck×n can be interpreted as a set of k basic concepts and Bm×k as the combination matrix, which describes each row of Am×n as a subset of basic concepts.
In the context of role mining, the concepts correspond to roles, and the combinations
correspond to user assignments. Indeed, such decomposition solutions are useful in a
number of other application domains including text mining [7], knowledge discovery
in databases [21], sampling, compression and clustering [10,31].
Since BMD can be viewed as a generalization of RMP, we first discuss its characteristics in the general sense, before focusing on role mining. The goal of the BMD
problem is to find a Boolean matrix decomposition solution for an input Boolean
matrix. While, there exist many decompositions of a Boolean matrix, certain decompositions better characterize the semantics associated with the original matrix.
Indeed, one can find decomposition solutions that are optimal with respect to different criteria that match various semantics. Based on this, we can identify the following
four important variants of the BMD problem:
• Find a BMD solution that minimizes the number of basic describing concepts
(which corresponds to the number of rows in the decomposed basis matrix).
• Find a BMD solution that minimizes the number of basic describing concepts,
while allowing a certain amount of inexactness in the decomposition. (In other
words, by trading off certain amount of accuracy, the goal is to express the input
data in a much more succinct form.)
• Find a BMD solution that minimizes the density of the decomposed matrices,
where density corresponds to the number of elements with the value of 1 (density can be viewed as an indicator of problem complexity).
• Given one of the decomposed matrices, find the other one. (This can be interpreted as the problem of describing individual records by combinations of the
basic concepts. In effect, the fourth variant is a subproblem of the other three
variants – however, a specific interesting case.)
In this paper, we study all four BMD problem variants. To provide more meaningful discussions and results, all problems are studied in the context of role engineering. We consider several variants of RMP, including basic RMP, δ-approximate
H. Lu et al. / An optimization framework for role mining
3
RMP, edge RMP, and usage RMP, that map to the above four decomposition criteria
for BMD. The solution to basic RMP gives an optimal set of roles that fully covers
the existing permissions of users. Minimizing the number of required roles would
help fully exploit the benefits of RBAC. Thus, basic RMP is the primary RMP variant. Instead of complete coverage on existing permissions required by basic RMP,
δ-approximate RMP allows up to δ errors in the reconstruction. There are two benefits to this. First, collected existing user-to-permission assignments may contain
noise. Enforcing complete coverage in the role mining process would then produce
biased roles. By allowing a small number of reconstruction errors, an intelligent algorithm might be able to omit matching the noise. Second, it usually takes much more
roles for full coverage than the number required to cover the majority of the permissions. Thus, if a small set of roles covers most permissions, those roles can be interpreted as being more fundamental (or real). Finally, when the role mining process
only aims to find promising candidate roles instead of finalizing roles, δ-approximate
RMP would again be preferred. Apart from the number of roles, administration load
is another way of evaluating the goodness of a role set, which can be formulated as
the edge-RMP problem. Edge-RMP does not necessarily minimize the number of
roles but gives optimal sum of user-to-role and role-to-permission assignments. In
reality some role sets may be available from prior experience. The system engineer
needs to investigate how well they fit existing user-to-permission assignments. This
leads to the usage RMP problem. All of the four variants have significance in solving
the role engineering problem and help the security administrators to pick and choose
the roles that perfectly suit the organizational needs.
The Boolean matrix decomposition problems can also be mapped to other areas, since data is Boolean in nature in many other applications as well, and can
be modeled as a high-dimensional Boolean matrix. For example, market basket data
(whether a transaction includes a product item), text document data (whether a document includes a keyword), and movie reviews (whether a user likes a movie) can
all be modeled as a Boolean matrix. Therefore, BMD is useful in these applications
as well. For example, consider text mining. Assume the m × n Boolean matrix A
represents the documents-words mapping, with rows corresponding to documents,
columns corresponding to words, and the value of 1 indicating the corresponding document includes the corresponding word. Now, the BMD variant minimizing
the number of concepts would give two decomposed Boolean matrices with much
smaller sizes. The basis matrix can be interpreted as topics, each of which consists
of a subset of words. The combination matrix can be used to describe what topics
are included in each document. So it successfully extracts “topics” from numbers
without domain knowledge and gives a succinct representation of the original data
with those topics. The δ-approx BMD variant is also applicable in this context, since
it can find a more succinct representation of the original data, though at the cost of
not being able to exactly recreate the original matrix. BMD variants studied in this
paper can find many applications in the data mining and machine learning fields.
4
H. Lu et al. / An optimization framework for role mining
In reality, in addition to the four major BMD variants we introduced, there could
be many other variants with different constraints and objectives. In light of this, we
present a unified integer linear programming framework for formulating all BMD
variants. Such a framework allows us to easily formulate any new BMD variant
based on a prior BMD variant that has already been studied. Additionally, it enables
us to directly adopt the huge body of heuristic solutions and tools developed for integer linear programming. Furthermore, approximation algorithms might be designed
from their linear programming relaxation. However, since Boolean data problems
are usually NP-hard, to solve large-scale problems, specifically designed efficient
heuristics are required. In this paper, we propose efficient heuristics for each of the
four presented BMD variants. Experimental results show that our algorithms have
very good performance.
The rest of the paper is organized as follows. Section 2 reviews related work in
the literature on role mining and Boolean matrix decomposition. Section 3 presents
preliminary background, defines all RMP variants, and discusses their computational
complexity. Section 4 presents ILP formulations for all RMP variants. In Section 5,
we provide efficient heuristics for solving these RMP variants. Section 6 presents a
comprehensive experimental evaluation using real data. Finally, Section 7 concludes
the paper and discusses future work.
2. Related work
The role engineering problem was first proposed by Coyne [2]. Since then, it has
received much attention from the access control community. The role engineering
approaches generally include top-down and bottom-up. Top-down approaches [25,
28] mine roles by investigating semantics of business processes and user responsibilities, which is impossible for large-scale problems. Bottom-up approaches mine
roles purely from existing user-to-permission assignments, which make it possible to
automatically discover promising roles. During the past years several algorithms explicitly designed for bottom-up role engineering have been proposed [3,4,15,34–36,
38].
Kuhlmann et al. [11] present a clustering technique similar to the well known
k-means clustering, which requires pre-defining the number of clusters. In [30],
Schlegelmilch and Steffens propose an agglomerative clustering based approach to
role mining (called ORCA). Vaidya et al. [37] propose an approach based on subset
enumeration, called RoleMiner. Molloy et al. [24] propose to clean noisy data before
executing role mining. Streich et al. [33] present a multi-assignment clustering approach for Boolean data. It is a probabilistic approach and can be applied to the role
mining problem. Some role mining approaches like [5] suggest to discover roles by
taking into account of available business information on permissions and users.
Several criteria or metrics have been used to evaluate the goodness of a candidate role set in the literature. In [37], Vaidya et al. suggest three criteria. The first
one is known as the basic role mining problem, which is to minimize the number of
H. Lu et al. / An optimization framework for role mining
5
roles required to cover the whole existing permission assignments. The second one
relaxes the first one by allowing certain amount of errors. The third one is given the
number of roles to minimize the number of roles. In [37], Vaidya et al. introduce
another criteria, which is to minimize the administrative cost of the resultant RBAC
system. In [22], Molloy et al. propose a notion called weighted structural complexity
that combines some of the above metrics and assigns them some weights. In [15],
Lu et al. formulate some role mining variants through mixed integer program, which
greatly simplifies the role mining work. Several works including [4], [37] and [15]
allow errors in the role mining solutions. In those works, different error types, overassignments and under-assignments are treated equally. Some recent works such as
[23] point out that it is safer to under assign permissions than over assign permissions.
It is now well known that the role mining problem is NP-hard [34] and that the
solutions to many other well-known NP-hard problems, such as graph vertex coloring, clique partition, binary matrix factorization, bi-clustering can be used to develop
a good solution [1,3]. A good overview of different role mining solutions and their
evaluation can be found in [23].
As stated before, role mining can be viewed as an instantiation of Boolean matrix decomposition. Traditional matrix factorization techniques such as the Singular
Value Decomposition (SVD) [8] employ the ordinary matrix product with addition
and multiplication operations. However, the ordinary matrix product is not able to
model logical set operations such as set union. On the contrary, Boolean matrix product can perfectly formulate the set union operation. That is why BMD has received
increasing attention recently. Miettinen et al. [19,21] employ BMD to formulate the
discrete basis problem. The goal of the discrete basis problem is to find a minimum
subcollection from a given collection of element subsets such that every element subset can be described as the union of some subsets in that subcollection. It has important implications in knowledge discovery in databases. The tiling database problems
[7] can be formulated as a BMD problem variant as well. Vaidya et al. [34] prove the
equivalence of the RMP, the discrete database problem, and the tiling database problem. One-dimensional BMD is studied by Lu et al. in [18]. In [16,17], an extended
Boolean matrix decomposition model is presented, which enables the formulation
of negative permission assignments and can summarize a Boolean dataset in a more
succinct fashion.
This paper is an extension of our prior work [15]. However, the majority of the
article is new. First, we introduce and investigate a new RMP variant, the usage RMP
problem. In addition to being an independent problem that is quite useful in real life
situations, the usage RMP problem can be viewed as a subproblem of other RMP
variants, which provides us with a new perspective to study other RMP variants. We
also provide an integer linear programming formulation and fast heuristics for the
usage RMP problem, which have not been developed before. Additionally, building
upon the solutions to the usage RMP problem, we develop effective and efficient
algorithms for the other RMP variants, like edge-RMP.
6
H. Lu et al. / An optimization framework for role mining
Our approach has several benefits. As discussed before, the basic role mining problem aims to decompose a Boolean matrix into two Boolean matrices. If the size of
the input Boolean matrix is m by n, then the solution space is of size 2m∗k+n∗k ,
where k denotes the number of roles. Conventional role mining approaches try to
discover both decomposed matrices at the same time. They do this by picking an
initial solution from the huge solution pool and using some heuristic to find a local
optimum. This has two drawbacks. First, given the enormous size of the solution
pool, it is difficult to find a good initial solution, which also makes it unlikely to
reach a good local optimum. Second, the procedures are computationally very expensive, since many iterations may be necessary to reach the local optimum, and
every step requires updating both matrices. Instead, our approach in this paper is
to utilize the divide-and-conquer strategy to split the role mining problem into two
sub-problems: (i) generate one decomposed Boolean matrix (role-permission) and
(ii) then determine the other one, by formulating and solving the usage role mining problem. By doing so, the algorithm randomness is significantly reduced, as we
only need to sample a significantly smaller Boolean matrix. Furthermore, practical
knowledge of the likely structure of roles can be utilized to pick the initial roles.
Both of these increase the chance of finding a good initial solution. After solving the
corresponding usage role mining problem, probabilistically we may obtain a better
decomposition solution. Also, since the whole procedure is less computationally expensive than the conventional role mining approaches, we can repeat the procedure
with different initial solutions, which again increase the chancing of finding a better
final solution. Effectively, this significantly reduces the computational complexity
and also increases the chances of obtaining a better solution.
All of the algorithmic solutions are completely new and comprehensive evaluations of the solutions are conducted on both synthetic and real data. Overall, we
extend the work in [15] by introducing a new problem, new algorithmic solution
techniques, and comprehensive experimental evaluation. Thus, the article provides
significant technical content that is of additional value well beyond the work in [15].
3. Boolean matrix decomposition and role mining
We start out by defining Boolean matrix multiplication, and Boolean matrix decomposition, and then briefly overview Role Based Access Control nomenclature
before actually discussing the role mining problem variants.
Definition 1 (Boolean matrix multiplication [34]). A Boolean matrix multiplication
between Boolean matrices A ∈ {0, 1}m×k and B ∈ {0, 1}k×n is A ⊗ B = C where
C is in space {0, 1}m×n and
Cij =
k
l=1
(Ail ∧ Blj ).
H. Lu et al. / An optimization framework for role mining
7
Boolean matrix multiplication thus essentially uses the logical set union operation to combine separate elements. Boolean matrix decomposition looks for a matrix
decomposition that satisfies Boolean multiplication.
Definition 2 (Boolean Matrix Decomposition (BMD)). If A ⊗ B = C, where A, B
and C are Boolean matrices, A ⊗ B is called a BMD solution of C.
Before discussing the role mining problems, we quickly review the basic nomenclature in Role Based Access Control. We adopt the NIST standard of the role based
access control model [6]. Note that in this paper we do not consider separate sessions,
i.e., all accesses are assumed to belong to a single session.
Definition 3 (RBAC0 ).
•
•
•
•
U , R, P and S denote users, roles, permissions and sessions, respectively;
PA ⊆ P × R, a many-to-many permission-to-role assignment relation;
UA ⊆ U × R, a many-to-many user-to-role assignment relation;
user : S → U , a function mapping each session si to the single user user(si )
(constant for the session’s lifetime);
• roles : S → 2R , a function mapping each session si to a set of roles roles(si ) ⊆
{r|(user(si ), r) ∈ UA} (which can change with time) and session si has the
permissions Ur ∈ roles(si ){p|(p, r) ∈ PA}.
In the role mining problem, the input is essentially the user-to-permission assignments UPA, and the goal is to find a good set of roles PA and appropriate user-to-role
assignments UA. Since all users should get the exact same permission sets after role
assignment as they originally had, the relation UPA = UA ⊗ PA holds. Mathematically, the role mining problem can be viewed as looking for a BMD solution of UPA.
Given UPA, there are many possible BMD solutions. However, only those optimizing
a specific criterion are desired.
One common criteria is the number of roles. In the BMD terminology, it corresponds to the number of rows in the decomposed Boolean matrix PA. As mentioned
earlier, the main reason why RBAC is preferred to traditional access control mechanism is that the required roles in RBAC are usually much less than permissions. As a
result, granting roles instead of permissions to users can significantly reduce administrative cost. Thus minimizing the number of roles can help reveal the full potential
of adopting RBAC. The basic RMP problem is formulated for this purpose. It was
first introduced by Vaidya et al. [34]. Its formal definition is as follows:
Problem 1 (Basic RMP [34]). Given user-to-permission assignments UPA, find
user-to-role assignments UA, and role-to-permission assignments PA, such that
UA ⊗ PA = UPA and the number of roles is minimized.
8
H. Lu et al. / An optimization framework for role mining
In some cases, it is not necessary to enforce the exact equivalence of UA ⊗ PA
and UPA. One illustrative case is that the given UPA contains noise, because of the
transfer of employees during the RBAC system implementation process. The exact
equivalence will cause the resulting solution to overfit the noise. The other case is
that some hybrid role mining approaches only employ bottom-up role mining approaches to find interesting or potentially valuable roles before finalizing the role
set. Relaxing the constraint of UA ⊗ PA = UPA would certainly help find more interesting roles. Such relaxation leads to another RMP variant, called δ-approximate
RMP. Its definition is as follows:
Problem 2 (δ-approximate RMP [34]). Given user-to-permission assignments UPA,
and a threshold δ, find user-to-role assignments UA, and role-to-permission assignments PA, such that UA ⊗ PA − UPA1 δ and the number of roles is minimized.
The · 1 operator is defined as the following.
Definition 4. X1 =
ij
|xij |.
The number of employed roles is not the only way of evaluating the goodness of
a set of roles. Strictly speaking, the administrative cost of a RBAC system can also
be measured by the total number of elements in UA and PA with the value of 1 (thus
representing the administrative burden). To record a role in a RBAC system, only its
contained permissions need to be maintained. Similarly, only granted roles need to
be maintained for a user assignment. This objective leads to another RMP variant,
called the edge RMP.
Problem 3 (Edge RMP [36]). Given user-to-permission assignments UPA, and a
threshold δ, find user-to-role assignments UA, and role-to-permission assignments
PA, such that UA ⊗ PA = UPA and UA1 + PA1 is minimized.
In addition to three above role mining variants, we present a new variant, called
the usage RMP. Consider the following scenario. A company desires to migrate from
the traditional access control system to RBAC. A consultant is hired to do this – the
consultant may have several past experiences in doing such work. The convenient
way to deal with the role mining problem is to borrow successful experiences from a
peer company, and to take the role set employed by the peer and check if the role set
fits the companies’ purpose. In the BMD terminology, assuming you are also given
PA, find the UA to reconstruct the existing UPA. This is formally defined as follows.
Problem 4 (Usage RMP). Given user-to-permission assignments UPA and role-topermission assignments PA, find user-to-role assignments UA, such that UA ⊗ PA −
UPA1 is minimized.
H. Lu et al. / An optimization framework for role mining
9
The usage RMP has two implications. First, when the optimal objective value of
min UPA − UA ⊗ PA1 is very small, a satisfactory role set can be obtained by
modifying such obtained PA to a small extent. In fact, the problem of migrating to
desired RBAC with minimal perturbation was studied in [35]. We can simply adopt
their solution to accomplish our goal. In this sense, the usage RMP serves as a subproblem of the other role mining variants. For instance, consider the δ-approximate
RMP problem. This can be broken down into two steps: finding a promising role
set and then checking if the role set can reconstruct UPA within the acceptable error
amount – now the second sub-problem corresponds to the usage RMP.
3.1. Complexity analysis
We now discuss the computational complexity of the above discussed BMD/RMP
problems. The basic RMP can be viewed as a special case of the δ-approx RMP with
δ of 0. In [34], the decision version of the basic RMP is proven to be NP-complete.
So is the decision version of the δ-approx RMP. The difference between the basic
RMP problem and the edge RMP problems is only in terms of the objective. The
decision version of the edge RMP is also proven to be NP-complete in [36]. As all
of the RMP variants are optimization problems, we have the following statement.
Statement 1. The basic RMP, the δ-approx RMP, and the edge RMP are all NPhard.
Theorem 1. The role usage problem is NP-hard and it is even NP-hard to approx1−ε
imate the role usage problem within a factor of Ω(2log P ), where P is the least
number of permissions that a user has in UPA.
Proof. First, we prove the hardness of the role usage problem. An instance of the
decision role usage problem can be represented by {UPAm×n , PAk×n , δ }, where
δ is a non-negative integer. Given a solution UAm×n , it takes polynomial time to
check whether UPA − UA ⊗ PA1 δ . So the role usage problem belongs to NP.
Furthermore, the basis usage problem [19], studied in the data mining field, can be
mapped to the role usage problem as follows. The basis usage problem aims to find
a binary matrix Xk×n such that Am×n − Cm×k ⊗ Xk×n 1 is minimized, where
Am×n and Cm×k are given and · 1 denotes the L-1 norm. For each instance
{Am×n , Cm×k , δ} of the decision basis usage problem, we can find a corresponding equivalent instance {UPA, PA, δ } of the decision role usage problem with the
following mapping: UPA = A, PA = C, and δ = δ. Thus, there exists a solution
Xk×n such that A − C ⊗ X1 δ if and only if there exists a solution UA such
that UPA − UA ⊗ PA1 δ. Since the basis usage problem is known to be NP-hard,
the role usage problem is also NP-hard.
Second, we prove the hardness-of-approximation of the role usage problem (the
study for the basis usage problem can be found in [19]). The role usage problem
10
H. Lu et al. / An optimization framework for role mining
can be related to the ±PSC (positive-negative partial set cover) problem. The ±PSC
problem is defined as follows: given a collection S of subsets of positive elements P
and negative elements N , find a subcollection C minimizing |P \ (∪C)| + |N ∩ (∪C)|,
where the |·| operator returns the count of elements in a set. Miettinen [20] shows that
1−ε
it is NP-hard to approximate the ±PSC problem to within a factor of Ω(2log |P | )
for any ε > 0, where |P | is the number of positive elements. For each instance
{P , N , S} of the ±PSC problem, we can create a corresponding equivalent instance
{UPAm×n , PAk×n } of the role usage problem as follows:
• let m (the number of users) be 1, n (the number of permissions) be |P | + |N |;
• let the first |P | permissions correspond to the |P | positive elements and the
remaining |N | permissions correspond to the |N | negative elements;
• let UPA be a 1 × (|P | + |N |) Boolean vector with the first m elements of being
1 and the rest being 0 (i.e. UPA corresponds to all positive elements);
• let k be |S| (the size of the subcollection of subsets);
• create a row vector for PAk×n corresponding to each subset in S such that if
the subset contains element i (positive or negative), make the vector component
corresponding to the element as 1 and the rest vector components as 0.
Given the mapping, a subcollection C of the solution to the ±PSC problem instance
can be mapped to UA1×k , the solution to the role usage problem instance, such that
its k vector components correspond to the subsets in S respectively, i.e. the ith component being 1 if its corresponding subset is in C; otherwise 0. Therefore, the constructed role usage problem instance is equivalent to the ±PSC problem instance and
both instances share the hardness-of-approximation result. Approximating each row
vector of UPA by PA can be viewed as a ±PSC problem instance. So it is NP-hard to
1−ε
approximate the role usage problem within a factor of Ω(2log P ), where P is the
least number of permissions that a user has in UPA (i.e number of 1’s elements in a
row vector of UPA). 2
4. Unified framework using integer linear programming
Role mining variants share much commonality. For example, the only difference
between the basic RMP and the edge RMP is in their objective function. The difference between the δ-approx RMP and the basic RMP is that the δ-approx RMP allows
inexactness. A naturally arising thought is that if we can build a unified framework
for all RMP variants, we then do not need to deal with each problem individually.
Additionally when new RMP variants appear, system engineers do not need to start
from scratch and can take advantage of algorithms that have been developed for existing RMP variants. As all RMP variants are essentially optimization problems, we
propose to formulate them through integer linear programming.
There are many benefits by connecting RMP variants with integer linear programming. First, optimization has been studied for more than 50 years. There are several
H. Lu et al. / An optimization framework for role mining
11
good exact optimization algorithms, even for integer linear programming, such as
branch-and-bound [13]. In addition, successful optimization software packages are
easily obtainable, such as Matlab and the Neos server.1 Even though those RMP variants are proven NP-hard, small or medium size problems can still be solved through
traditional optimization techniques. In addition to exact algorithms, approximation
algorithms may be developed through LP-based techniques, such as dual-fitting [14],
randomized rounding [9], and primal-dual schema [12]. The linear programming
framework we will propose in fact can not only incorporate RMP variants, but also
problems in other application domains, such as tiling database problems [7] and discrete basis problems [21].
4.1. Usage RMP
We start from the usage RMP as it is essentially a subproblem of the other RMP
variants. In the usage RMP, the input is UPA and PA, and the goal is to find UA
minimizing the reconstruction error. For simplicity of notation, we denote PA by the
Boolean matrix R and UA by the Boolean matrix variable X. Then the role usage
problem can be roughly represented as follows:
minimize UPA − X ⊗ R1 .
To formulate it as an explicit integer linear programming problem, we first formulate UPA − X ⊗ R = 0 and then relax it by tolerating errors. To do so, we let Ri
denote role i and UPAi denote permissions assigned to user i. UPA − X ⊗ R = 0
implies that every user’s permission set should be represented as a union of some
candidate roles. This can be modeled as follows:
UPAi =
Rt ,
(1)
t∈si
where si denotes the role subset assigned to user i. {si } can convert to a Boolean
matrix X such that Xit = 1 if role t belongs to si , otherwise Xit = 0.
The constraint essentially says that if some user has a particular permission, at
least one role having that permission has to be assigned to that user. In turn, if that
user does not have some permission, none of the roles having that permission can be
assigned to it. So X ⊗ R = UPA can be transformed to the following equation set
⎧ q
⎪
⎪
⎪
Xit Rtj 1, if UPAij = 1,
⎪
⎨
t=1
q
⎪
⎪
⎪
⎪
Xit Rtj = 0, if UPAij = 0.
⎩
t=1
1 http://www-neos.mcs.anl.gov/.
(2)
12
H. Lu et al. / An optimization framework for role mining
To formulate inexactness, we introduce a non-negative slack variable {Vij } to each
constraint and have the following modified constraints.
⎧ q
⎪
⎪
⎪
Xit Rtj + Vij 1, if UPAij = 1,
⎪
⎨
t=1
(3)
q
⎪
⎪
⎪
⎪
Xit Rtj − Vij = 0, if UPAij = 0.
⎩
t=1
q
q
For UPAij =
t=1 Xit Rtj + Vij 1allows
t=1 Xit Rtj to be 0. For
q1,
q
UPAij = 0, t=1 Xit Rtj − Vij = 0 allows t=1 Xit Rtj to be greater than 1. In
other words, if Vij > 0, the constraint enforcing whether UPAij = 1 or UPAij = 0
is not satisfied. The objective function UPA − X ⊗ R1 is then the count of unsatisfied constraints for UPA = X ⊗ R. To formulate the objective function, we need to
count the positive variables {Vij }. To do so, we introduce another Boolean variable
set {Uit } and enforce the following constraints
M Uij Vij ,
Uij Vij ,
where the big M is a large constant greater than q. The above two inequalities ensure
that
if Vij 1, Uij = 1 and if Vij = 0, Uij = 0. Thus, the count of positive {Vij } is
ij Uij . Therefore, the integer linear programming formulation for δ-approx RMP
is as follows:
Uij
minimize
ij
⎧ q
⎪
⎪
⎪
Xit Rtj + Vij 1,
⎪
⎪
⎪
⎪
t=
1
⎪
⎪
q
⎪
⎨
Xit Rtj − Vij = 0,
⎪
⎪
t=
1
⎪
⎪
⎪
M Uij − Vij 0,
⎪
⎪
⎪
⎪
Vij ,
U
⎪
⎩ ij
Xit , Uij ∈ {0, 1}, Vij 0.
if UPAij = 1,
if UPAij = 0,
(4)
∀i, j,
∀i, j,
4.2. Basic RMP
With the ILP formulation for the usage RMP, it is easy to formulate others. Consider the basic RMP, where the input is UPA, and the goal is to find UA and PA. It
can be succinctly modeled as an optimization problem as follows:
minimize |Roles|
s.t. UA ⊗ PA = UPA.
H. Lu et al. / An optimization framework for role mining
13
For simplicity, we break up the basic RMP into two subproblems: (i) find a candidate role set {R1 , R2 , . . . , Rq }; (ii) locate a minimum candidate role subset to form
PA and then determine UA.
A role is nothing, but a permission subset. Given n permissions, there are 2n possible roles. But most of them can be easily eliminated. For example, if none of users
had both permissions i and j, any permission subset containing both permissions i
and j can never be a candidate role. We will explicitly discuss the generation of candidate roles in the next section. Here assume we have already produced a candidate
role set {R1 , R2 , . . . , Rq }. Then consider the second subproblem, which is similar to
the usage RMP.
Every user’s permission set should be able to be represented as a union of some
candidate roles. This can be phrased as the following:
UPAi =
Rt ,
(5)
t∈si
where si denotes the candidate role subset assigned to user i and UPAi denotes the
permission subset assigned to user i. {si } can convert to a Boolean matrix X such
that Xit = 1 if candidate role t belongs to si , otherwise Xit = 0. As both X and
R are Boolean matrices, their relation can be represented by the following Boolean
matrix multiplication,
Xn×q ⊗ Rq×m = UPAn×n ,
(6)
where R and UPA are given and X is unknown. As Xit indicates whether Rj is
assigned to user i, so if ∀t Xit = 0, it means that Rt is never employed. So the basic
RMP is essentially to minimize the number of non-zero columns in X.
Same as the usage RMP, the constraint X ⊗ R = UPA can be enforced by
⎧ q
⎪
⎪
⎪
Xit Rtj 1, if UPAij = 1,
⎪
⎨
t=1
q
⎪
⎪
⎪
⎪
Xit Rtj = 0, if UPAij = 0.
⎩
(7)
t=1
Now we need to count the number of non-zero columns in X. To do so, we introduce a set of Boolean indicator variables {d
1 , . . . , dq }, where dt = 1 indicates role
q
t is present, otherwise not. Then |Roles| = t=1 dt . Since dt indicates the presence
of a role, dt should only be 1 when at least one user is assigned role t. Thus, dt = 1
when at least one of {X1t , . . . , Xmt } is 1. We can formulate this by adding in the
constraints
dt Xit ,
∀i, t.
(8)
14
H. Lu et al. / An optimization framework for role mining
Finally, putting every thing together, the basic RMP is formulated as follows:
minimize
dt
t
⎧ q
⎪
⎪
⎪
Xit Rtj 1, if UPAij = 1,
⎪
⎪
⎪
⎪
⎪
⎨ t=q 1
Xit Rtj = 0, if UPAij = 0,
⎪
⎪
⎪
⎪
t=1
⎪
⎪
⎪
d Xij ,
∀i, j,
⎪
⎩ t
dt , Xij ∈ {0, 1}.
(9)
4.3. δ-approximate RMP
The goal of the δ-approx RMP problem is to minimize the number of required
roles
amount is formulated
within a tolerable amount of error. In Eq. (4), the error as ij Uij . In Eq. (9), the number of roles is formulated as t dt . By combining
these two equation systems together, we easily obtain the formulation for the δapproximate RMP problem as Eq. (10):
minimize
dt
t
⎧ q
⎪
⎪
⎪
Xit Rtj + Vij 1,
⎪
⎪
⎪
⎪ t=1
⎪
⎪
⎪
q
⎪
⎪
⎪
⎪
Xit Rtj − Vij = 0,
⎪
⎪
⎪
⎨ t=1
MU ij − Vij 0,
⎪
⎪
⎪
Uij Vij ,
⎪
⎪
⎪
⎪
Xit ,
d
t ⎪
⎪
⎪
⎪
⎪
Uij δ,
⎪
⎪
⎪
⎪
i
j
⎪
⎩
dj , Xit , Uij ∈ {0, 1}, Vij 0,
if UPAij = 1,
if UPAij = 0,
∀i, j,
∀i, j,
∀i, t,
(10)
m
where n
i=1
j=
1 Uij δ ensures the total deviating cells less than δ and the
q
objective function t=1 dt is the number of mined roles.
4.4. Edge RMP
Unlike the basic RMP, the goal of the edge RMP is to minimize the administrative
cost. As a RBAC system only needs to maintain explicit user-to-role assignments and
H. Lu et al. / An optimization framework for role mining
15
explicit role-to-permission assignments, the administrative cost is hence evaluated by
positive cells in UA and PA. By employing the same notations in the ILP formulation
for the basic RMP as in Eq. (9), the administrative cost can be formulated as the
following
UA1 + PA1 =
i
t
Xit +
t
dt
Rtj ,
(11)
j
where i t Xit is the count of positive cells in UA and t (dt j Rtj ) counts
positive cells in selected candidate roles, which is PA1 . Thus, the ILP formulation
for the edge RMP can be obtained by replacing the objective function in Eq. (9) with
the formula in Eq. (11). Finally, in cases where one evaluates the administrative cost
as the number of assignments in UA, PA, and the number of roles, the objective is to
minimize |UA| + |PA| + |UPA| [3,38]. This can also be easily obtained by combining
the objective function in Eqs (9) and (11).
4.5. Discussion
From the ILP formulating processes for those role mining variants, we observe
that once one role mining variant is successfully formulated, it is easy to formulate
other variants. This fact illustrates the benefit of building a unified ILP framework of
the role mining problem. Such an ILP framework is broad and flexible to incorporate
new variants, which may appear in practice. For instance, a system engineer may not
want to see 0-becoming-1 errors in the δ-approx RMP problem. In the RBAC context,
0-becoming-1 errors mean that a user obtains unauthorized permissions, which may
harm the system security severely. To reflect this concern
the ILP formulation, we
in
q
could simplyreplace the constraint for UPAij = 0 of t=1 Xit Rtj − Vij = 0 in
q
Eq. (10) by t=1 Xit Rtj = 0. Consider another instance that all roles are expected
to be
highly repetitively used.
To realize this expectation, we could add a constraint
that i Xit LB, where i Xit is the number of users granted role t and LB is a
specified lower bound. This will ensure that every picked role is employed more than
LB times. Thus, other variants of the role mining problems can be created simply by
adding more constraints or modifying the objective function. This is why we consider
the optimization framework to be a general framework capable of modeling various
matrix decomposition problems.
5. Heuristics
The ILP formulation for the usage RMP takes mn variables and for other RMP
variants takes about mq variables, where m is the number of users, n is the number
of permissions, and q is the number of candidate roles. The state-of-art ILP software
16
H. Lu et al. / An optimization framework for role mining
packages can deal with up to millions variables. So for problems of sizes with m
and q greater than 1000, we have to resort to efficient heuristics. In this section, we
will propose efficient heuristics for RMP variants. They are easy to implement and
run fast. Their effectiveness will be validated in the next section. Like the above ILP
formulations that require candidate roles to be given, our heuristics need candidate
roles generated beforehand as well. So before presenting our heuristics, we will first
discuss ways of generating candidate roles.
5.1. Candidate role set generation
We present three ways of generating candidate roles. All of these have been proposed in prior work either in the role mining context or in the discrete basis context.
Note that any other technique could also be used, since it is merely the first step of
our approach.
Intersection. We call the first method Intersection, which is proposed by Vaidya et
al. in [37]. This method includes every unique user’s permission set as a candidate
role. In addition the intersections of every two user permission sets are included as
candidate roles as well. This method is based on two observations. First in order
to make a permission subset be a role, it must be employed and assigned to some
user. In other words, a candidate role must be a subset of some user’s permission
set. The other observation is that to make a RBAC system succinct and efficient,
roles should be repetitively employed. So a role is desired to be assigned to multiple
users. Hence, a candidate role is expected to be the intersection of two or multiple
user’s permission sets. Given m users, candidate roles generated by the Intersection
method is O(m2 ).
Association. This method exploits the correlations between the columns of UPA
by employing the association rule mining concept in [26]. It was presented as a part
of the ASSO algorithm proposed for the discrete basis problem. The concrete generation process is as follows. Suppose the user-to-permission assignment UPAm×n is
given. Let UPA(:, i) denote the ith column. Then n candidate roles will be generated,
represented by a Boolean matrix Cn×n . In which, Cij = 1 if
UPA(:, i), UPA(:, j) \ UPA(:, i), UPA(:, i) τ ,
(12)
otherwise 0, where ·, · is the vector inner product operation, and τ is a predetermined threshold controlling the level of correlation.
Itself. This method is to simply treat unique user’s permission sets from UPA as
candidate roles. It was studied in [19], but in the scenario of the discrete basis problem. The problem is to find a discrete basis for an input Boolean matrix, such that
the discrete basis is a subset of columns (or rows) of the input Boolean matrix. In
the role mining context, if employing this method to generate candidate roles, it is
assumed that for every role there must be one user who is granted this role and only
this role. While this may not necessarily make sense in the role mining context, it
can indeed be meaningful in other domains such as text mining.
H. Lu et al. / An optimization framework for role mining
17
Algorithm 1. Greedy algorithm for usage RMP
Input: UPA1×n and PAr×n
Output: UA1×r
1: UA1×r ← (0, . . . , 0);
2: for i ← 1 : r do
3:
if PAi ⊆ UPA then
4:
UA[i] = 1;
5:
end if
6: end for
7: loop
8:
PAi = the best remaining role, j = the largest improvement;
9:
if j > 0 then
10:
UA[i] = 1;
11:
else
12:
break
13:
end if
14: end loop
5.2. Usage RMP
As we mentioned earlier, the usage RMP is equivalent to the basis usage problem. A heuristic called Loc & IterX was proposed for the basis usage problem in
[19]. Here we propose two more effective heuristics, one greedy approach and one
simulated annealing approach.
Without loss of generality, we assume that there is only one user. Hence UPA is
1 × n. A role set PAr×n is given. The usage RMP becomes to find a subset of r roles,
such that the union of permissions contained in those roles is closest to UPA.
Greedy. A greedy algorithm is any algorithm that makes the locally optimal choice
at each stage with the hope of finding the global optimum. It has many successful
applications, such as the traveling salesman problem and the knapsack problem. Our
greedy approach consists of a preliminary step. It is to include all roles in PA which
are subordinate to UPA into UA. A role is considered subordinate to UPA, if all
permissions in the role belong to UPA. After that, iteratively pick one remaining
role which reduces the reconstruction error by the most, till the reconstruction error
cannot be reduced any more.
Simulated annealing. Simulated annealing is a generic probabilistic heuristic for
the global optimization problem. It locates a good approximation to the global optimum of a given function in a large search space. Unlike greedy heuristics, simulated
annealing is a generalization of a Markov chain Monte Carlo method, which has a
solid theoretical foundation. We present a simulated annealing heuristic for the role
usage problem as follows: First, we let (0, . . . , 0) be the starting state of UA1×r .
18
H. Lu et al. / An optimization framework for role mining
Then at each stage, randomly select its neighboring value by randomly picking one
element of UA and flipping its value from 0 to 1 or from 1 to 0. If the new UA better reconstructs UPA, the next state is the new UA. If not, with a certain probability
less than 1, the next state is the new UA. In other words, with certain probability, it
remains its original state. This property reduces the chance of being stuck at a local
optimum. The procedure described above allows a solution state to move to another
solution state and hence produces a Markov chain. Denote the nth state by x and the
randomly selected neighboring value be y. If the next state is y with probability
exp{λV (x)/N (x)}
min 1,
exp{λV (y)/N (y)}
or it remains x, where λ is a constant, V (t) is the reconstruction error with the solution t, and N (t) is the number of neighboring values of t. Such a Markov chain
has a limiting probability of 1 for arriving at optimal minimization solutions when
λ → ∞ [29]. But it has been found to be more useful or efficient to allow the value
of λ to change with time. Simulated annealing is a popular variation of the preceding procedure. Here, we adopt the formula proposed by Besag et al. [27] and let the
transition probability be
exp{λn V (x)/N (x)}
,
min 1,
exp{λn V (y)/N (y)}
where λn = log(1 + n). In our case, N (x) and N (y) are equivalent and are canceled
out in the formula. As computing time is limited, we terminate the algorithm after
a certain number of iterations regardless of whether or not the global optimum is
reached. Our complete simulated annealing algorithm is as described in Algorithm 2.
5.3. Basic RMP
The heuristic proposed for the basic RMP also runs in a greedy fashion. Consider
the ILP formulation for the basic RMP given in Eq. (9). There are two main types of
constraints, one for
{UPAij = 1} and the other for {UPAij = 0}. In order to satisfy
q
the constraint set { t=1 Xit Rtj = 0, if UPAit = 0}, if Rjt = 1, Xit must be equal
to 0. Therefore,
many variables Xit can be determined directly in this way. Then the
q
constraint set { t=
1qXit Rtj = 0, if UPAit = 0} is all satisfied.
Next consider { t=1 Xit Rtj 1, if UPAit = 1}. To satisfy it, if the particular
cells Rt1 j , Rt2 ,j , . . . , Rtk ,j are 1, one of the associated cells Xit1 , Xit2 , . . . , Xitk has
to be 1.
q
Now, consider the objective function min t=1 dt . As dt is determined by the
constraint dt Xit . Hence once one of the cells {Xit , ∀t} is 1, dt is forced to
be 1. In other words, no matter how many variables in {Xit , ∀t} take the value
of 1, the objective function value is always increased by 1. Our greedy algorithm
H. Lu et al. / An optimization framework for role mining
19
Algorithm 2. Simulated annealing algorithm for usage RMP
Input: UPA1×n and PAr×n
Output: UA1×r
1: UA, x ← (0, . . . , 0); n ← 1;
2: while n limit and V (UA) > 0 do
3:
y = a random neighboring value of x;
4:
if V (y) < V (UA) then
5:
UA ← y;
6:
end if
7:
x = y with probability
exp{log(1+n)V (x)}
min{1, exp{
C log(1+n)V (y )} };
8:
n ← n + 1;
9: end while
utilizes this property. At each step, choose a candidate role Rt such that by setting
q {Xit , ∀t}, except those predetermined, be 1, the most remaining constraints
{ t=1 Xit Rtj = 1, if UPAij = 1} are satisfied. We call the count of such satisfied remaining constraints the basic-key.
Definition 5 (Basic-key).
For each column of the variable matrix {Xit }, the count
q
of the constraints { t=1 Xij Rtj = 1, if UPAij = 1} being satisfied by letting the
cells of such a variable column be 1 except the cells predetermined, is called its
basic-key.
If there are multiple columns with the same greatest basic-key, we simply choose
the column with the least index. Once a column is chosen, it means that the associated
candidate role is chosen and the associated dt is set to be 1. Then delete the satisfied
constraints and perform the same procedure repetitively till all the constraints are
satisfied. The full procedure of this greedy algorithm is given in Algorithm 3.
5.4. δ-approximate RMP
From the ILP formulation perspective, the δ-approximate RMP seems much more
complicated than the basic RMP. However an efficient greedy heuristic for the δapproximate RMP can be easily developed by modifying the greedy heuristic for
the basic RMP. As the δ-approximate RMP tolerates δ amount of errors, we can
terminate Algorithm 3 early once the remaining uncovered 1’s cells are less than δ.
5.5. Edge RMP
Remember that the goal of the edge RMP is to minimize the administrative cost
UA1 + PA1 , while the basic RMP only aims to minimize the number of roles.
20
H. Lu et al. / An optimization framework for role mining
Algorithm 3. Greedy algorithm for basic RMP
Input: Unique user-to-permission assignments UPAm×n and candidate roles
{R1 , . . . , Rq }.
Output: UA and PA
1: PA = ∅;
q
2: Investigate the constraints { t=1 Xit Rtj = 0} in Eq. (9) and determine some
of Xit to be 0;
3: Select the candidate role Rt with the greatest basic-key value and include it in
PA.
4: Let undetermined values in {Xit } be 1 and deleter hence satisfied constraints in
{ qt=1 Xit Rtj > 1};
q
5: Go back to step 3 till all constraints of { t=1 Xit Rtj
= 0} and
q
{ t=1 Xit Rtj > 0} are satisfied.
6: Set the remaining variables in X to be 0.
7: UA is the subset of rows in X corresponding to selected roles in PA.
So a basic RMP optimal solution is not necessarily an edge RMP solution. The edge
RMP was studied in [36], where a greedy heuristic was proposed. To distinguish it,
we call it Edge-Key. Here we propose a new heuristic, called Two-Stage.
The edge RMP objective consists of two parts UA1 and PA1 . Intuitively lower
number of roles leads to a lower value of PA. In this sense, an optimal solution of
basic RMP can be a reasonable solution for edge RMP. Our greedy heuristic for basic
RMP is able to produce a good PA. However, UA produced by the greedy heuristic
is just a byproduct and not in its optimal form. We can start a second phase and
reduce user-role assignments by reassigning obtained roles of PA to each user. It is
essentially another basic RMP problem that assigns the minimum roles from PA to
exactly cover all permissions for a user. Therefore, we can again adopt our greedy
heuristic. It is obvious that if a role contains permissions not originally possessed
by a user, the role can never be assigned to the user. With this rationale, we can
simplify the second-phase task by filtering out unlikely roles from PA in advance.
The complete algorithm is as described in Algorithm 4.
6. Experimental evaluation
In this section, we conduct extensive experiments on both synthetic data sets and
real data sets to evaluate the performance of our heuristics.
6.1. Synthetic data
The synthetic data sets are generated as follows. Firstly, generate a set of unique
roles PAr×n and user-to-permission assignments UAm×r in a random fashion. UPA
H. Lu et al. / An optimization framework for role mining
21
Algorithm 4. Two-stage algorithm for edge-RMP
Input: User-permission assignments UPAm×n and candidate roles {R1 , . . . , Rq }.
Output: UA1 + PA1
1: Run Algorithm 3 with {R1 , . . . , Rq } to obtain PA.
2: for each UPAi do
3:
Remove roles that contain permissions not belonging to UPAi from PA.
4:
Run Algorithm 3 with remaining roles in PA to obtain UAi .
5: end for
Table 1
Notations
m
n
r
ρ1
ρ2
# of users
# of permissions
# of roles
Density of PA
Density of UA
noise
limit
Noise percentage
Maximum number of iterations
is the Boolean product of PA and UA. Two parameters ρ1 and ρ2 are employed to
control the density of 1’s cells in PA and UA respectively, which then determine the
density of 1’s cells in UPA. Precisely, to create a role, generate a random number
Poissrnd(ρ1 ) from a Poisson distribution with the mean of ρ1 × n. If the generated
number is greater than n, perform it again. Then randomly generate a Boolean vector
with Poissrnd(ρ1 ) 1’s elements. We generate a user’s role assignment in a similar
way, except replacing ρ1 with ρ2 . To reflect real data sets, we also add in noise by
flipping the values for a portion of cells. noise is the noise percentage parameter. For
convenience of reference, we list all notations in Table 1. In which, the parameter
limit is used in Algorithm 2 to control the maximum number of iterations.
The first experiment is to study the usage RMP. We compare the Loc & IterX
algorithm proposed in [19] with our greedy heuristic and our simulated annealing
(SA) heuristic with limit of 500. Three algorithms are compared with respect to the
reconstruction error ratio, which is defined as the following.
Error Ratio = UPA − UPA 1 /UPA1 ,
where UPA denotes the reconstructed UPA.
The concrete experimental design is as follows. (i) Generate a number of data sets
{UA, PA, UPA} with m = 50, n = 50, ρ1 = 0.3, ρ2 = 0.3 and noise varying from
0 to 0.2; (ii) Run the Loc & IterX, the greedy algorithm, and the SA algorithm with
22
H. Lu et al. / An optimization framework for role mining
Fig. 1. Experimental results on usage RMP. (a) ρ1 = 0.3, ρ2 = 0.3, m = 50, n = 50, r = 10,
limit = 500. (b) ρ1 = 0.3, ρ2 = 0.3, m = 1, n = 50, r = 30, noise = 0.2. (c) ρ1 = 0.3, ρ2 = 0.3,
m = 1, n = 50, r = 30, noise = 0.2. (Colors are visible in the online version of the article; http://
dx.doi.org/10.3233/JCS-130484.)
limit of 500 respectively to obtain UA and UPA given PA and UPA. The experiment result is illustrated as shown in Fig. 1(a). It shows the greedy algorithm and
the SA algorithm are significantly better than the Loc & IterX algorithm. The other
observation is that the performances of the greedy heuristic and the SA heuristic are
comparable.
Unlike the greedy heuristic, which returns a local optimum in most cases, a SA
heuristic can with probability one reach a global optimum given enough computing
time; in other words unlimited number of iterations. However, the meaning of existence for a heuristic is that it can get a good solution in an acceptable time. So
we conduct another experiment to investigate how our SA heuristic performs with
respect to limit. We generate 100 sets of {UA, PA, UPA} with ρ1 = 0.3, ρ2 = 0.3,
m = 1, n = 50, r = 30 and noise = 0.2. limit varies from 50 to 2500. Run both
the greedy heuristic and the SA heuristic against them. Figure 1(b) shows that the
SA heuristic has better performance than the greedy heuristic when limit is greater
than 500. It also shows at the beginning small increase in limit can improve the performance by a lot for the SA heuristic. However, when limit reaches certain point,
the improvement becomes much less significant. It somehow shows that the greedy
heuristic has relatively satisfactory performance. Figure 1(c) plots computing times
for both methods. The computing time for the SA heuristic increases linearly with
limit. However, to achieve the same performance as the greedy heuristic, the SA
heuristic takes more time. So we conclude that the SA heuristic is recommended for
small-size problems while the greedy heuristic is suitable for large-size problems.
Next we study our greedy heuristic for basic RMP. The heuristic is closely dependent on candidate roles. Three ways of generating candidate roles were introduced:
Itself, Intersection and Association. As Association has a parameter of association
threshold τ , for a fair comparison, we consider two cases, τ = 0.9 and τ = 0.7. We
generate {UPA} with ρ1 = 0.3, ρ2 = 0.3, m = 50, n = 50, r = 10, noise = 0.
Then run Algorithm 3 with each candidate role set respectively to find UA and PA.
The results are illustrated as shown in Fig. 2(a). Both Intersection and Itself are able
to reconstruct UPA completely, but Association. With respect to the error reducing
H. Lu et al. / An optimization framework for role mining
23
Fig. 2. Experimental results on basic RMP and δ-approximate RMP. (a) ρ1 , ρ2 = 0.3, m, n = 50, r = 10,
noise = 0. (b) ρ1 , ρ2 = 0.3, m = 1, n = 50, r = 30, noise = 0.2. (c) ρ1 , ρ2 = 0.3, m, n = 50,
r = 10, noise = 0. (d) m, n = 50, r = 10, noise = 0. (e) ρ1 , ρ2 = 0.3, r = 10, noise = 0. (Colors are
visible in the online version of the article; http://dx.doi.org/10.3233/JCS-130484.)
speed, Association is inferior to two other approaches. Overall Intersection has the
best performance, while the performance of Association is far from satisfactory. So
ignoring Association, we then further study Itself and Intersection with respect to
computing time. Figure 2(b) shows that Intersection takes much more time than Itself. The underlying reason is that Intersection produces O(m2 ) candidate roles as
opposed to O(n) candidate roles produced by Itself. We conclude that for small size
problems Intersection is preferred while Itself is recommended for large size problems.
Our greedy heuristic for δ-approximate RMP is same as that for basic RMP, except
that it terminates early. Therefore, the previous experimental results are valid for
studying δ-approximate RMP. Figure 2(c) plots reconstruction error ratio values with
respect to required number of roles. If a small amount of errors are allowed, UPA
can be successfully reconstructed with much less number of roles. For example,
Itself needs only 10 roles to cover more than 80 percent of existing permissions,
while 20 extra roles are needed to cover the remaining permissions. Somehow it can
be interpreted that roles returned by δ-approximate RMP are more “fundamental”.
It suggests that if a bottom-up role engineering approach is only used to identify
promising roles to assist a top-down engineering approach, δ-approximate RMP is
more efficient. Figure 2(d) shows the relation between data density and the number
of required roles for full coverage. As the data density is indirectly determined by ρ1
and ρ2 . The experiment is to let ρ1 and ρ2 have the same value and vary them together
24
H. Lu et al. / An optimization framework for role mining
from 0.1 to 0.6. In Fig. 2(d), both lines suggest that for denser data sets more roles
are required. In other words, our greedy heuristic performs better for sparse data sets,
while real access control data sets are usually very sparse. We then study the relation
between data size and our greedy heuristic performance. We let m and n be same
and vary them from 40 to 100. Figure 2(e) illustrates that our heuristic performs good
for medium and small sizes.
Next we evaluate the Two-Stage algorithm proposed for edge RMP by comparing
it with the Edge-Key algorithm proposed in [36]. For convenience, we let default parameters for generating UPA be {m = 50, n = 50, ρ1 = 0.3, ρ2 = 0.3, noise = 0}.
We then vary one parameter each time, generating a set of UPAs, and run TwoStage and Edge-Key respectively on them. As both algorithms require input candidate roles, we consider both Itself and Intersection. Hence, we actually compare
four approaches: (Two-Stage, Itself ), (Two-Stage, Intersection), (Edge-Key, Itself )
and (Edge-Key, Intersection). Experimental results are illustrated as shown in Fig. 3.
All four graphs suggest that Two-Stage is better than Edge-Key.
We also compare our heuristics with the neural network algorithm and the genetic algorithm proposed in [32]. These two algorithms can only apply to the
δ-approximate RMP problem, as they do not guarantee returning an exact Boolean
matrix decomposition solution. So the comparison is only with respect to the
δ-approximate RMP problem.
Before we present the experiment results, we briefly introduce the two algorithms
to be compared with. The Neural Network algorithm is based on a Hopfield-like
neural network of n fully connected neurons, where n is corresponding to the number
of permissions in the role mining setting. The weights between neurons are learned
from the training dataset UPA based on the Hebbian rule. To reveal factors (roles),
Fig. 3. Reconstructed error ratio w.r.t. percentage of required roles for real data sets. (Colors are visible in
the online version of the article; http://dx.doi.org/10.3233/JCS-130484.)
H. Lu et al. / An optimization framework for role mining
25
a random pattern (a user’s assignment) is fed to the neural network and the resultant
attractor is a factor. The resultant attractors might be spurious. So as suggested by
[32], different random patterns with the number of 1’s elements ranging from 1 to
n are fed into the neural network. For each random pattern, attractors with different
number of 1’s elements are derived by adjusting the threshold function. At the end,
k attractors among all found attractors with the minimum Lyapunov function values
are picked as the final role set. The Genetic algorithm starts with an initial population
of solutions and lets them evolve with crossover and mutation operators towards
better solutions in terms of a fitness function. The initial population is randomly
generated. Our experiment will use the same algorithmic parameter values as in [32].
The experimental design is as follows. We randomly generate a set of Boolean
matrices UPA, UA and PA with the same procedure as described before. Then we
separately apply our heuristics, the neural network algorithm and the genetic algorithm to recover the original role set PA from the given UPA. We compare all the
discovered role sets in terms of the closeness to the real role set. The closeness between the original role set PAk×n and the derived role set PAk×n is defined as the
sum of the hamming distances between each real role and its closest derived role. It
is mathematically expressed as:
k i=1
PAij − PAtj .
min
1tk
(13)
j
To evaluate those algorithms, we conduct experiments on two sets of data. The
first set of {UA, PA, UPA} is generated by letting r = 10, ρ1 = 0.3, ρ2 = 0.3 and
noise = 0 while varying m and n from 30 to 90. The results are reported in Table 2,
where the best solutions are marked in bold. Among all algorithms, Intersection has
the best performance and discovers all the original roles for the four out of seven
cases. Both Intersection and Itself perform better than Neural Network and Genetic.
The experimental result demonstrates the ability of our algorithms in revealing the
underlying roles in the Boolean matrix of user-permission assignments. We did not
observe clear association between the algorithm performance and the data size. Intuitively, with the fixed number of roles (patterns), it would be relatively easy to
detect roles (patterns) for the large Boolean matrix (training data). However, as the
data size grows, the number of potential patterns grows exponentially, which would
Table 2
Comparison results w.r.t. closeness with varying data size and fixed number of roles
m, n
30
40
50
60
70
80
Intersection
0
0
0
220
10
240
0
Itself
Neural network
Genetic
0
0
20
100
60
50
0
50
30
220
220
260
230
190
210
240
260
322
0
120
230
90
26
H. Lu et al. / An optimization framework for role mining
Table 3
Comparison results w.r.t. closeness with fixed data size and varying number of roles
r
5
10
15
20
25
30
Intersection
100
160
30
260
350
480
Itself
Neural Network
Genetic
100
120
100
100
170
225
195
60
195
300
420
315
575
590
480
660
525
580
counter the positive effect brought by the growth of training data. The second set
of {UA, PA, UPA} is generated by letting m = 50, n = 50, ρ1 = 0.3, ρ2 = 0.3
and noise = 0 while varying r from 5 to 10. The result is reported in Table 3. Intersection has the best performance for the five out of six cases. When r being 10,
Itself performs the best. Both Intersection and Itself perform better than both Neural
Network and Genetic for all cases. This experimental result again demonstrates the
advantage of our heuristics in discovering roles. In Table 3, there is an approximate
trend showing the larger value of r leads the worse algorithm performance in terms
of closeness. One possible explanation is that when there are many underlying patterns more training data are required to discover them. In this experiment, since the
values of m and n are fixed, it becomes difficult for an algorithm to discover many
embedded roles.
The neural network approach for the Boolean matrix decomposition itself is very
interesting and provides a fresh perspective to the role mining problem. One possible explanation why it does not perform well in our experiment is that the Hopfield
neural network has limited ability in storing patterns. When there are many roles
to store, the derived Hopfield neural network may contain many spurious patterns
(attractors) and there is no effective way to distinguish real and spurious patterns.
However, the neural network algorithm might be very helpful for detecting noise
in a set of user-permission assignments with known roles. We plan to look into this
problem in the future. The genetic algorithm is a local-search heuristic, which is fundamentally same as the greedy heuristic and the simulated annealing heuristic used
in our approaches. One explanation why the genetic algorithm does not work well in
our setting is that there are too many algorithmic parameters in the genetic heuristic.
The genetic heuristic tries to discover the two decomposed Boolean matrices at the
same time. So given m users, n permission and k roles, there are m ∗ n ∗ p variables
to determine simultaneously. In that case, thousands iterations in a typical genetic
algorithm search process are not enough to discover a good solution. Differently, our
approach discovers a good set of roles first and then try to reconstruct role assignment for each user individually. It takes less computation and hence increases the
likelihood of finding a good solution in a limited amount of time.
6.2. Real data
We run our greedy heuristics on real data sets collected by Ene et al. [3]. They are
emea, healthcare, domino, firewall 1 and firewall 2.
H. Lu et al. / An optimization framework for role mining
27
Fig. 4. Reconstructed error ratio w.r.t. number of required roles for real data sets. (Colors are visible in
the online version of the article; http://dx.doi.org/10.3233/JCS-130484.)
The first experiment is to study the basic RMP. We run Algorithm 3 against each
data set with candidate roles generated by Itself and Intersection respectively. Numbers of roles required for full coverage are recorded in Fig. 4. The number of required roles is much less than the number of permissions for each case. It successfully demonstrates the power of RBAC and also the effectiveness of our heuristic
on discovering roles. Another observation is that the performance of Intersection is
not always better than Itself when working with our greedy heuristic. In theory, with
Intersection the optimal solution of a basic RMP is better than that with Itself, as
the feasible solution space is expanded. However, if the algorithm runs in a greedy
manner, it is not always true.
The second experiment is to study the δ-approximate RMP. We run Algorithm 3
and record remaining errors when a new role is identified. Figure 4 plots reconstruction error ratios with respect to number of required roles. The first observation is that
all lines drop fast at the beginning. It suggests that a few roles are usually able to reduce error ratio to a very low level. Roles early identified can hence be interpreted as
“more valuable”. Another observations is that much more roles are required to cover
the remaining few 1’s at the end. It suggests hat if a role mining approach is used
as a tool to assist a top-down role engineering approach, δ-approximate RMP might
be sufficient. The other important observation is that Intersection dose not always
perform better than Itself in our greedy heuristic. In fact, if only considering early
mined roles, their performances are comparable.
The third experiment is to study the edge RMP. We run (Two-Stage, Itself ), (TwoStage, Intersection), (Edge-Key, Itself ) and (Edge-Key, Intersection) respectively
against each data set. The results are reported in Table 4. There are three key ob-
28
Data set
emea
healthcare
domino
firewall 1
firewall 2
Users
35
46
79
365
325
Permissions
3,046
46
231
709
590
UPA1
7,220
1,486
730
31,951
36,428
Itself
Intersection
# of roles
Edge value
Edge-Key
Two-Stage
# of roles
Edge value
Edge-Key
Two-Stage
34
16
20
71
10
7,281
624
810
5,475
1,796
7,246
478
727
4,937
1,456
43
14
21
65
10
9,078
553
814
4,508
1,796
9,014
412
827
4,034
1,456
H. Lu et al. / An optimization framework for role mining
Table 4
Number of roles and edge value for full coverage for real data sets
H. Lu et al. / An optimization framework for role mining
29
servations. The first observation is that Intersection performs better than or equally
to Itself for four out of the five data sets in terms of the number of required roles
for complete coverage. This is to be expected since Intersection has a larger pool of
candidate roles and runs in a greedy fashion. Therefore, more 1’s elements would be
covered at each step than what would be covered by Itself. Thus, intuitively Intersection returns fewer roles for full coverage. However, it is not always true, as shown
by emea, which is an exceptional case. Here, Intersection returns 43 roles, which is
larger than the 34 roles returned by Itself. One reason for this is that both Itself and
Intersection are local-search algorithms and do not guarantee global optimum solutions. Therefore, better initial steps may lead to more iterations and more roles in the
end. However, the experience teaches us that when computing time is ample, an ensemble approach consisting of several different heuristics should be considered. The
second observation is that the fewer roles always are associated with the less edge
value. The reason for this is that the edge value consists of |UA| and |PA|. When the
roles are fewer, the value of |PA| would likely be less. Since the number of roles is
less, the size of user-role assignments UA would likely be less as well. Note that this
is just an intuition and might not always be true, although there is no counterexample
here. The third observation is that Two-Stage returns less edge values than Edge-Key
for all cases except for the domino dataset when Intersection is used. One explanation is that Two-Stage has two steps of reducing the edge value (i.e. the first step is to
find PA with small size and the second step is to find UA with small size by calling
the role usage problem), while Edge-Key finds UA and PA simultaneously. Therefore, intuitively Two-Stage would have a better chance to find a RBAC solution with
smaller edge value.
7. Conclusions and future work
This paper studies the Boolean matrix decomposition problem with more attention
on its application on the role mining problem. Four BMD variants with pragmatic
implications in real application domains are explicitly presented. The main technical
contribution of this paper is two-fold: First, it builds a unified framework for a variety
of BMD variants by formulating them through ILP; Second, it presents efficient and
effective heuristics for BMD variants, which are validated by extensive experiments.
There are still several research issues that need to be explored. First, the performance of our heuristics largely depends on candidate role sets. To obtain a better
candidate role set, it would be beneficial to incorporate expert knowledge on the
semantics of internal business processes and narrow down the pool of promising
candidate roles. Second, for large-scale problems, computation time is always an issue, even for heuristics. A data structure specialized for such computation may help
reduce the burden. Third, many different constraints, which are not considered in
this paper, may exist in practice. How to incorporate them in the heuristic designing process is worth investigating further. We plan to look at these problems in the
future.
30
H. Lu et al. / An optimization framework for role mining
References
[1] A. Colantonio, R. Di Pietro, A. Ocello and N.V. Verde, Taming role mining complexity in RBAC,
Computers & Security 29 (2010), 548–564. (Special Issue: Challenges for Security, Privacy &
Trust.)
[2] E. Coyne, Role engineering, in: Proceedings of the First ACM Symposium on Access Control Models
and Technologies, 1995.
[3] A. Ene, W. Horne, N. Milosavljevic, P. Rao, R. Schreiber and R.E. Tarjan, Fast exact and heuristic
methods for role minimization problems, in: SACMAT’08: Proceedings of the 13th ACM Symposium
on Access Control Models and Technologies, ACM, New York, NY, USA, 2008, pp. 1–10.
[4] M. Frank, D. Basin and J.M. Buhmann, A class of probabilistic models for role engineering, in:
Proceedings of the 15th ACM Conference on Computer and Communications Security, 2008.
[5] M. Frank, A.P. Streich, D.A. Basin and J.M. Buhmann, A probabilistic approach to hybrid role
mining, in: ACM Conference on Computer and Communications Security, 2009, pp. 101–111.
[6] M.P. Gallagher, A. O’Connor and B. Kropp, The economic impact of role-based access control,
Planning Report 02-1, National Institute of Standards and Technology, 2002.
[7] F. Geerts, B. Goethals and T. Mielikainen, Tiling databases, in: Discovery Science, Springer, 2004,
pp. 278–289.
[8] G. Golub and C.V. Loan, Matrix Computations, 3rd edn, The Johns Hopkins Univ. Press, Baltimore,
MD, USA, 1996.
[9] D. Hochbaum, Approximation algorithms for the set covering and vertex cover problems, SIAM
Journal on Computing 11(3) (1982), 555–556.
[10] M. Koyutürk and A. Grama, Proximus: a framework for analyzing very high dimensional discreteattributed datasets, in: KDD’03: Proceedings of the Ninth ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, ACM, New York, NY, USA, 2003, pp. 147–156.
[11] M. Kuhlmann, D. Shohat and G. Schimpf, Role mining – revealing business roles for security administration using data mining technology, in: SACMAT ’03: Proceedings of the Eighth ACM Symposium on Access Control Models and Technologies, ACM, New York, NY, USA, 2003, pp. 179–186.
[12] K. Kuhn, The Hungarian method for the assignment problem, Navel Research Logistics Quarterly
2 (1955), 83–97.
[13] A.H. Land and A.G. Doig, An automatic method of solving discrete programming problems, Econometrica 28 (1960), 497–520.
[14] L. Lovasz, On the ratio of optimal integral and fractional covers, Discrete Mathematics 13 (1975),
383–390.
[15] H. Lu, J. Vaidya and V. Atluri, Optimal Boolean matrix decomposition: Application to role engineering, in: ICDE’08: Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, IEEE Computer Society, Washington, DC, USA, 2008, pp. 297–306.
[16] H. Lu, J. Vaidya, V. Atluri and Y. Hong, Extended Boolean matrix decomposition, in: ICDM, 2009,
pp. 317–326.
[17] H. Lu, J. Vaidya, V. Atluri and Y. Hong, Constraint-aware role mining via extended Boolean matrix
decomposition, IEEE Trans. Dependable Sec. Comput. 9(5) (2012), 655–669.
[18] H. Lu, J. Vaidya, V. Atluri, H. Shin and L. Jiang, Weighted rank-one binary matrix factorization, in:
SDM, 2011, pp. 283–294.
[19] P. Miettinen, The Boolean column and column-row matrix decompositions, Data Min. Knowl. Discov. 17(1) (2008), 39–56.
[20] P. Miettinen, On the positive–negative partial set cover problem, Inf. Process. Lett. 108(4) (2008),
219–221.
[21] P. Miettinen, T. Mielikainen, A. Gionis, G. Das and H. Mannila, The discrete basis problem, in:
Knowledge Discovery in Databases: PKDD 2006 – 10th European Conference on Principles and
Practice of Knowledge Discovery in Databases, Springer, 2006, pp. 335–346.
H. Lu et al. / An optimization framework for role mining
31
[22] I. Molloy, H. Chen, T. Li, Q. Wang, N. Li, E. Bertino, S.B. Calo and J. Lobo, Mining roles with
semantic meanings, in: SACMAT, 2008, pp. 21–30.
[23] I. Molloy, N. Li, T. Li, Z. Mao, Q. Wang and J. Lobo, Evaluating role mining algorithms, in:
SACMAT, 2009, pp. 95–104.
[24] I. Molloy, N. Li, Y.A. Qi, J. Lobo and L. Dickens, Mining roles with noisy data, in: Proceedings of
the 15th ACM Symposium on Access Control Models and Technologies, SACMAT’10, ACM, New
York, NY, USA, 2010, pp. 45–54.
[25] G. Neumann and M. Strembeck, A scenario-driven role engineering process for functional RBAC
roles, in: SACMAT’02: Proceedings of the Seventh ACM Symposium on Access Control Models and
Technologies, ACM, New York, NY, USA, 2002, pp. 33–42.
[26] M. Pauli, M. Taneli, G. Aristides, D. Gautam and M. Heikki, The discrete basis problem, IEEE
Transactions on Knowledge and Data Engineering 20(10) (2008), 1348–1362.
[27] J.B. Peter, P. Green, D. Higdon and K. Mengersen, Bayesian computation and stochastic systems,
Statistical Science 10(1) (1995), 3–41.
[28] H. Roeckle, G. Schimpf and R. Weidinger, Process-oriented approach for role-finding to implement
role-based security administration in a large industrial organization, in: RBAC’00: Proceedings of
the Fifth ACM Workshop on Role-Based Access Control, ACM, New York, NY, USA, 2000, pp. 103–
110.
[29] S.M. Ross, Simulation (Statistical Modeling and Decision Science), 3rd edn, Academic Press, 2002.
[30] J. Schlegelmilch and U. Steffens, Role mining with ORCA, in: SACMAT’05: Proceedings of the
Tenth ACM Symposium on Access Control Models and Technologies, ACM, New York, NY, USA,
2005, pp. 168–176.
[31] B.-H. Shen, S. Ji and J. Ye, Mining discrete patterns via binary matrix factorization, in: KDD’09:
Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, ACM, New York, NY, USA, 2009, pp. 757–766.
[32] V. Snasel, J. Platos, P. Kromer, D. Husek, R. Neruda and A.A. Frolov, Investigating Boolean matrix
factorization, in: Proceeding of the Data Mining Using Matrices and Tensors Workshop, DMMT’08,
2008.
[33] A.P. Streich, M. Frank, D.A. Basin and J.M. Buhmann, Multi-assignment clustering for Boolean
data, in: ICML, 2009, p. 122.
[34] J. Vaidya, V. Atluri and Q. Guo, The role mining problem: finding a minimal descriptive set of
roles, in: SACMAT’07: Proceedings of the 12th ACM Symposium on Access Control Models and
Technologies, ACM, New York, NY, USA, 2007, pp. 175–184.
[35] J. Vaidya, V. Atluri, Q. Guo and N. Adam, Migrating to optimal RBAC with minimal perturbation,
in: SACMAT’08: Proceedings of the 13th ACM Symposium on Access Control Models and Technologies, ACM, New York, NY, USA, 2008, pp. 11–20.
[36] J. Vaidya, V. Atluri, Q. Guo and H. Lu, Edge-RMP: Minimizing administrative assignments for
role-based access control, J. Comput. Secur. 17(2) (2009), 211–235.
[37] J. Vaidya, V. Atluri and J. Warner, Roleminer: mining roles using subset enumeration, in: CCS’06:
Proceedings of the 13th ACM Conference on Computer and Communications Security, ACM, New
York, NY, USA, 2006, pp. 144–153.
[38] D. Zhang, K. Ramamohanarao and T. Ebringer, Role engineering using graph optimisation, in: Proceedings of the 12th ACM Symposium on Access Control Models and Technologies, SACMAT’07,
ACM, New York, NY, USA, 2007, pp. 139–144.