Journal of Computer Security 22 (2014) 1–31 DOI 10.3233/JCS-130484 IOS Press 1 An optimization framework for role mining Haibing Lu a,∗ , Jaideep Vaidya b and Vijayalakshmi Atluri b a OMIS, Santa Clara University, Santa Clara, CA, USA E-mail: hlu@scu.edu b MSIS, Rutgers University, Newark, NJ, USA E-mails: {jsvaidya, atluri}@cimic.rutgers.edu Role Based Access Control (RBAC) is accepted as the de facto access control model for organizations of all sizes. However, engineering the right set of roles is crucial to enable the correct deployment of RBAC within an organization. Indeed, discovering an optimal and correct set of roles from existing permission assignments, referred to as the role mining problem (RMP), has gained significant attention in recent years. Role Mining is itself an instantiation of Boolean matrix decomposition – wherein a Boolean matrix is decomposed into two Boolean matrices giving a set of basis vectors and their appropriate combination. In fact, such decompositions are useful in a number of application domains beyond role engineering, including text mining as well as knowledge discovery. While a Boolean matrix can be decomposed in many ways, however, certain decompositions better characterize the semantics associated with the original matrix in a succinct but comprehensive way. Indeed, one can find different decompositions that are optimal with respect to different criteria that may match various semantics. In this paper, we first present a number of variants of the optimal Boolean matrix decomposition problem, including usage RMP, basic RMP, δ-approximate RMP, and edge RMP, that have pragmatic implications in the context of role mining. We then present a unified framework for modeling the optimal Boolean matrix decomposition and its variants using integer linear programming (ILP). Such modeling allows us to directly adopt the huge body of heuristic solutions and tools developed for integer linear programming. We also develop efficient heuristics and solutions for each RMP variant, and validate them by a comprehensive experimental evaluation. Keywords: Role based access control, role mining, Boolean matrix decomposition, integer linear programming 1. Introduction Role based Access Control (RBAC) is widely considered as the de-facto standard used for advanced access control. Unlike traditional access control wherein permissions are directly assigned to users, in RBAC permissions are assigned to roles and in turn roles are assigned to users. The main benefit of RBAC is to reduce the load of administration, since the number of required roles is often orders of magnitude lower than the number of users/permissions. However, to realize the full potential of RBAC, it is necessary to define the complete and correct set of roles. This process of defining roles and user-role assignments for access control is known as Role en* Corresponding author. E-mail: hlu@scu.edu. 0926-227X/14/$27.50 © 2014 – IOS Press and the authors. All rights reserved 2 H. Lu et al. / An optimization framework for role mining gineering [2], and has been identified as the costliest component in realizing RBAC by National Institute of Standard and Technology (NIST) [6]. Recently, automated techniques have been developed for bottom-up role discovery and assignment. This process of identifying the roles from the existing user-to-permission assignments is referred to as the role mining problem (RMP) [34], and has been the focus of significant research. Note that user-to-permission assignments can be formulated as a Boolean matrix, with 1 at entry (i, j) denoting user i is assigned permission j. Similarly user-to-role assignments and role-to-permission mappings can also be formulated as Boolean matrices. Thus, Role Mining can be viewed as an instantiation of Boolean matrix decomposition (BMD). In BMD, a Boolean matrix Am×n is decomposed into two Boolean matrices Bm×k and Ck×n . However, instead of ordinary matrix product, Am×n is decomposed as a Boolean matrix product of Bm×k and Ck×n . Now, the rows of Ck×n can be interpreted as a set of k basic concepts and Bm×k as the combination matrix, which describes each row of Am×n as a subset of basic concepts. In the context of role mining, the concepts correspond to roles, and the combinations correspond to user assignments. Indeed, such decomposition solutions are useful in a number of other application domains including text mining [7], knowledge discovery in databases [21], sampling, compression and clustering [10,31]. Since BMD can be viewed as a generalization of RMP, we first discuss its characteristics in the general sense, before focusing on role mining. The goal of the BMD problem is to find a Boolean matrix decomposition solution for an input Boolean matrix. While, there exist many decompositions of a Boolean matrix, certain decompositions better characterize the semantics associated with the original matrix. Indeed, one can find decomposition solutions that are optimal with respect to different criteria that match various semantics. Based on this, we can identify the following four important variants of the BMD problem: • Find a BMD solution that minimizes the number of basic describing concepts (which corresponds to the number of rows in the decomposed basis matrix). • Find a BMD solution that minimizes the number of basic describing concepts, while allowing a certain amount of inexactness in the decomposition. (In other words, by trading off certain amount of accuracy, the goal is to express the input data in a much more succinct form.) • Find a BMD solution that minimizes the density of the decomposed matrices, where density corresponds to the number of elements with the value of 1 (density can be viewed as an indicator of problem complexity). • Given one of the decomposed matrices, find the other one. (This can be interpreted as the problem of describing individual records by combinations of the basic concepts. In effect, the fourth variant is a subproblem of the other three variants – however, a specific interesting case.) In this paper, we study all four BMD problem variants. To provide more meaningful discussions and results, all problems are studied in the context of role engineering. We consider several variants of RMP, including basic RMP, δ-approximate H. Lu et al. / An optimization framework for role mining 3 RMP, edge RMP, and usage RMP, that map to the above four decomposition criteria for BMD. The solution to basic RMP gives an optimal set of roles that fully covers the existing permissions of users. Minimizing the number of required roles would help fully exploit the benefits of RBAC. Thus, basic RMP is the primary RMP variant. Instead of complete coverage on existing permissions required by basic RMP, δ-approximate RMP allows up to δ errors in the reconstruction. There are two benefits to this. First, collected existing user-to-permission assignments may contain noise. Enforcing complete coverage in the role mining process would then produce biased roles. By allowing a small number of reconstruction errors, an intelligent algorithm might be able to omit matching the noise. Second, it usually takes much more roles for full coverage than the number required to cover the majority of the permissions. Thus, if a small set of roles covers most permissions, those roles can be interpreted as being more fundamental (or real). Finally, when the role mining process only aims to find promising candidate roles instead of finalizing roles, δ-approximate RMP would again be preferred. Apart from the number of roles, administration load is another way of evaluating the goodness of a role set, which can be formulated as the edge-RMP problem. Edge-RMP does not necessarily minimize the number of roles but gives optimal sum of user-to-role and role-to-permission assignments. In reality some role sets may be available from prior experience. The system engineer needs to investigate how well they fit existing user-to-permission assignments. This leads to the usage RMP problem. All of the four variants have significance in solving the role engineering problem and help the security administrators to pick and choose the roles that perfectly suit the organizational needs. The Boolean matrix decomposition problems can also be mapped to other areas, since data is Boolean in nature in many other applications as well, and can be modeled as a high-dimensional Boolean matrix. For example, market basket data (whether a transaction includes a product item), text document data (whether a document includes a keyword), and movie reviews (whether a user likes a movie) can all be modeled as a Boolean matrix. Therefore, BMD is useful in these applications as well. For example, consider text mining. Assume the m × n Boolean matrix A represents the documents-words mapping, with rows corresponding to documents, columns corresponding to words, and the value of 1 indicating the corresponding document includes the corresponding word. Now, the BMD variant minimizing the number of concepts would give two decomposed Boolean matrices with much smaller sizes. The basis matrix can be interpreted as topics, each of which consists of a subset of words. The combination matrix can be used to describe what topics are included in each document. So it successfully extracts “topics” from numbers without domain knowledge and gives a succinct representation of the original data with those topics. The δ-approx BMD variant is also applicable in this context, since it can find a more succinct representation of the original data, though at the cost of not being able to exactly recreate the original matrix. BMD variants studied in this paper can find many applications in the data mining and machine learning fields. 4 H. Lu et al. / An optimization framework for role mining In reality, in addition to the four major BMD variants we introduced, there could be many other variants with different constraints and objectives. In light of this, we present a unified integer linear programming framework for formulating all BMD variants. Such a framework allows us to easily formulate any new BMD variant based on a prior BMD variant that has already been studied. Additionally, it enables us to directly adopt the huge body of heuristic solutions and tools developed for integer linear programming. Furthermore, approximation algorithms might be designed from their linear programming relaxation. However, since Boolean data problems are usually NP-hard, to solve large-scale problems, specifically designed efficient heuristics are required. In this paper, we propose efficient heuristics for each of the four presented BMD variants. Experimental results show that our algorithms have very good performance. The rest of the paper is organized as follows. Section 2 reviews related work in the literature on role mining and Boolean matrix decomposition. Section 3 presents preliminary background, defines all RMP variants, and discusses their computational complexity. Section 4 presents ILP formulations for all RMP variants. In Section 5, we provide efficient heuristics for solving these RMP variants. Section 6 presents a comprehensive experimental evaluation using real data. Finally, Section 7 concludes the paper and discusses future work. 2. Related work The role engineering problem was first proposed by Coyne [2]. Since then, it has received much attention from the access control community. The role engineering approaches generally include top-down and bottom-up. Top-down approaches [25, 28] mine roles by investigating semantics of business processes and user responsibilities, which is impossible for large-scale problems. Bottom-up approaches mine roles purely from existing user-to-permission assignments, which make it possible to automatically discover promising roles. During the past years several algorithms explicitly designed for bottom-up role engineering have been proposed [3,4,15,34–36, 38]. Kuhlmann et al. [11] present a clustering technique similar to the well known k-means clustering, which requires pre-defining the number of clusters. In [30], Schlegelmilch and Steffens propose an agglomerative clustering based approach to role mining (called ORCA). Vaidya et al. [37] propose an approach based on subset enumeration, called RoleMiner. Molloy et al. [24] propose to clean noisy data before executing role mining. Streich et al. [33] present a multi-assignment clustering approach for Boolean data. It is a probabilistic approach and can be applied to the role mining problem. Some role mining approaches like [5] suggest to discover roles by taking into account of available business information on permissions and users. Several criteria or metrics have been used to evaluate the goodness of a candidate role set in the literature. In [37], Vaidya et al. suggest three criteria. The first one is known as the basic role mining problem, which is to minimize the number of H. Lu et al. / An optimization framework for role mining 5 roles required to cover the whole existing permission assignments. The second one relaxes the first one by allowing certain amount of errors. The third one is given the number of roles to minimize the number of roles. In [37], Vaidya et al. introduce another criteria, which is to minimize the administrative cost of the resultant RBAC system. In [22], Molloy et al. propose a notion called weighted structural complexity that combines some of the above metrics and assigns them some weights. In [15], Lu et al. formulate some role mining variants through mixed integer program, which greatly simplifies the role mining work. Several works including [4], [37] and [15] allow errors in the role mining solutions. In those works, different error types, overassignments and under-assignments are treated equally. Some recent works such as [23] point out that it is safer to under assign permissions than over assign permissions. It is now well known that the role mining problem is NP-hard [34] and that the solutions to many other well-known NP-hard problems, such as graph vertex coloring, clique partition, binary matrix factorization, bi-clustering can be used to develop a good solution [1,3]. A good overview of different role mining solutions and their evaluation can be found in [23]. As stated before, role mining can be viewed as an instantiation of Boolean matrix decomposition. Traditional matrix factorization techniques such as the Singular Value Decomposition (SVD) [8] employ the ordinary matrix product with addition and multiplication operations. However, the ordinary matrix product is not able to model logical set operations such as set union. On the contrary, Boolean matrix product can perfectly formulate the set union operation. That is why BMD has received increasing attention recently. Miettinen et al. [19,21] employ BMD to formulate the discrete basis problem. The goal of the discrete basis problem is to find a minimum subcollection from a given collection of element subsets such that every element subset can be described as the union of some subsets in that subcollection. It has important implications in knowledge discovery in databases. The tiling database problems [7] can be formulated as a BMD problem variant as well. Vaidya et al. [34] prove the equivalence of the RMP, the discrete database problem, and the tiling database problem. One-dimensional BMD is studied by Lu et al. in [18]. In [16,17], an extended Boolean matrix decomposition model is presented, which enables the formulation of negative permission assignments and can summarize a Boolean dataset in a more succinct fashion. This paper is an extension of our prior work [15]. However, the majority of the article is new. First, we introduce and investigate a new RMP variant, the usage RMP problem. In addition to being an independent problem that is quite useful in real life situations, the usage RMP problem can be viewed as a subproblem of other RMP variants, which provides us with a new perspective to study other RMP variants. We also provide an integer linear programming formulation and fast heuristics for the usage RMP problem, which have not been developed before. Additionally, building upon the solutions to the usage RMP problem, we develop effective and efficient algorithms for the other RMP variants, like edge-RMP. 6 H. Lu et al. / An optimization framework for role mining Our approach has several benefits. As discussed before, the basic role mining problem aims to decompose a Boolean matrix into two Boolean matrices. If the size of the input Boolean matrix is m by n, then the solution space is of size 2m∗k+n∗k , where k denotes the number of roles. Conventional role mining approaches try to discover both decomposed matrices at the same time. They do this by picking an initial solution from the huge solution pool and using some heuristic to find a local optimum. This has two drawbacks. First, given the enormous size of the solution pool, it is difficult to find a good initial solution, which also makes it unlikely to reach a good local optimum. Second, the procedures are computationally very expensive, since many iterations may be necessary to reach the local optimum, and every step requires updating both matrices. Instead, our approach in this paper is to utilize the divide-and-conquer strategy to split the role mining problem into two sub-problems: (i) generate one decomposed Boolean matrix (role-permission) and (ii) then determine the other one, by formulating and solving the usage role mining problem. By doing so, the algorithm randomness is significantly reduced, as we only need to sample a significantly smaller Boolean matrix. Furthermore, practical knowledge of the likely structure of roles can be utilized to pick the initial roles. Both of these increase the chance of finding a good initial solution. After solving the corresponding usage role mining problem, probabilistically we may obtain a better decomposition solution. Also, since the whole procedure is less computationally expensive than the conventional role mining approaches, we can repeat the procedure with different initial solutions, which again increase the chancing of finding a better final solution. Effectively, this significantly reduces the computational complexity and also increases the chances of obtaining a better solution. All of the algorithmic solutions are completely new and comprehensive evaluations of the solutions are conducted on both synthetic and real data. Overall, we extend the work in [15] by introducing a new problem, new algorithmic solution techniques, and comprehensive experimental evaluation. Thus, the article provides significant technical content that is of additional value well beyond the work in [15]. 3. Boolean matrix decomposition and role mining We start out by defining Boolean matrix multiplication, and Boolean matrix decomposition, and then briefly overview Role Based Access Control nomenclature before actually discussing the role mining problem variants. Definition 1 (Boolean matrix multiplication [34]). A Boolean matrix multiplication between Boolean matrices A ∈ {0, 1}m×k and B ∈ {0, 1}k×n is A ⊗ B = C where C is in space {0, 1}m×n and Cij = k l=1 (Ail ∧ Blj ). H. Lu et al. / An optimization framework for role mining 7 Boolean matrix multiplication thus essentially uses the logical set union operation to combine separate elements. Boolean matrix decomposition looks for a matrix decomposition that satisfies Boolean multiplication. Definition 2 (Boolean Matrix Decomposition (BMD)). If A ⊗ B = C, where A, B and C are Boolean matrices, A ⊗ B is called a BMD solution of C. Before discussing the role mining problems, we quickly review the basic nomenclature in Role Based Access Control. We adopt the NIST standard of the role based access control model [6]. Note that in this paper we do not consider separate sessions, i.e., all accesses are assumed to belong to a single session. Definition 3 (RBAC0 ). • • • • U , R, P and S denote users, roles, permissions and sessions, respectively; PA ⊆ P × R, a many-to-many permission-to-role assignment relation; UA ⊆ U × R, a many-to-many user-to-role assignment relation; user : S → U , a function mapping each session si to the single user user(si ) (constant for the session’s lifetime); • roles : S → 2R , a function mapping each session si to a set of roles roles(si ) ⊆ {r|(user(si ), r) ∈ UA} (which can change with time) and session si has the permissions Ur ∈ roles(si ){p|(p, r) ∈ PA}. In the role mining problem, the input is essentially the user-to-permission assignments UPA, and the goal is to find a good set of roles PA and appropriate user-to-role assignments UA. Since all users should get the exact same permission sets after role assignment as they originally had, the relation UPA = UA ⊗ PA holds. Mathematically, the role mining problem can be viewed as looking for a BMD solution of UPA. Given UPA, there are many possible BMD solutions. However, only those optimizing a specific criterion are desired. One common criteria is the number of roles. In the BMD terminology, it corresponds to the number of rows in the decomposed Boolean matrix PA. As mentioned earlier, the main reason why RBAC is preferred to traditional access control mechanism is that the required roles in RBAC are usually much less than permissions. As a result, granting roles instead of permissions to users can significantly reduce administrative cost. Thus minimizing the number of roles can help reveal the full potential of adopting RBAC. The basic RMP problem is formulated for this purpose. It was first introduced by Vaidya et al. [34]. Its formal definition is as follows: Problem 1 (Basic RMP [34]). Given user-to-permission assignments UPA, find user-to-role assignments UA, and role-to-permission assignments PA, such that UA ⊗ PA = UPA and the number of roles is minimized. 8 H. Lu et al. / An optimization framework for role mining In some cases, it is not necessary to enforce the exact equivalence of UA ⊗ PA and UPA. One illustrative case is that the given UPA contains noise, because of the transfer of employees during the RBAC system implementation process. The exact equivalence will cause the resulting solution to overfit the noise. The other case is that some hybrid role mining approaches only employ bottom-up role mining approaches to find interesting or potentially valuable roles before finalizing the role set. Relaxing the constraint of UA ⊗ PA = UPA would certainly help find more interesting roles. Such relaxation leads to another RMP variant, called δ-approximate RMP. Its definition is as follows: Problem 2 (δ-approximate RMP [34]). Given user-to-permission assignments UPA, and a threshold δ, find user-to-role assignments UA, and role-to-permission assignments PA, such that UA ⊗ PA − UPA1 δ and the number of roles is minimized. The · 1 operator is defined as the following. Definition 4. X1 = ij |xij |. The number of employed roles is not the only way of evaluating the goodness of a set of roles. Strictly speaking, the administrative cost of a RBAC system can also be measured by the total number of elements in UA and PA with the value of 1 (thus representing the administrative burden). To record a role in a RBAC system, only its contained permissions need to be maintained. Similarly, only granted roles need to be maintained for a user assignment. This objective leads to another RMP variant, called the edge RMP. Problem 3 (Edge RMP [36]). Given user-to-permission assignments UPA, and a threshold δ, find user-to-role assignments UA, and role-to-permission assignments PA, such that UA ⊗ PA = UPA and UA1 + PA1 is minimized. In addition to three above role mining variants, we present a new variant, called the usage RMP. Consider the following scenario. A company desires to migrate from the traditional access control system to RBAC. A consultant is hired to do this – the consultant may have several past experiences in doing such work. The convenient way to deal with the role mining problem is to borrow successful experiences from a peer company, and to take the role set employed by the peer and check if the role set fits the companies’ purpose. In the BMD terminology, assuming you are also given PA, find the UA to reconstruct the existing UPA. This is formally defined as follows. Problem 4 (Usage RMP). Given user-to-permission assignments UPA and role-topermission assignments PA, find user-to-role assignments UA, such that UA ⊗ PA − UPA1 is minimized. H. Lu et al. / An optimization framework for role mining 9 The usage RMP has two implications. First, when the optimal objective value of min UPA − UA ⊗ PA1 is very small, a satisfactory role set can be obtained by modifying such obtained PA to a small extent. In fact, the problem of migrating to desired RBAC with minimal perturbation was studied in [35]. We can simply adopt their solution to accomplish our goal. In this sense, the usage RMP serves as a subproblem of the other role mining variants. For instance, consider the δ-approximate RMP problem. This can be broken down into two steps: finding a promising role set and then checking if the role set can reconstruct UPA within the acceptable error amount – now the second sub-problem corresponds to the usage RMP. 3.1. Complexity analysis We now discuss the computational complexity of the above discussed BMD/RMP problems. The basic RMP can be viewed as a special case of the δ-approx RMP with δ of 0. In [34], the decision version of the basic RMP is proven to be NP-complete. So is the decision version of the δ-approx RMP. The difference between the basic RMP problem and the edge RMP problems is only in terms of the objective. The decision version of the edge RMP is also proven to be NP-complete in [36]. As all of the RMP variants are optimization problems, we have the following statement. Statement 1. The basic RMP, the δ-approx RMP, and the edge RMP are all NPhard. Theorem 1. The role usage problem is NP-hard and it is even NP-hard to approx1−ε imate the role usage problem within a factor of Ω(2log P ), where P is the least number of permissions that a user has in UPA. Proof. First, we prove the hardness of the role usage problem. An instance of the decision role usage problem can be represented by {UPAm×n , PAk×n , δ }, where δ is a non-negative integer. Given a solution UAm×n , it takes polynomial time to check whether UPA − UA ⊗ PA1 δ . So the role usage problem belongs to NP. Furthermore, the basis usage problem [19], studied in the data mining field, can be mapped to the role usage problem as follows. The basis usage problem aims to find a binary matrix Xk×n such that Am×n − Cm×k ⊗ Xk×n 1 is minimized, where Am×n and Cm×k are given and · 1 denotes the L-1 norm. For each instance {Am×n , Cm×k , δ} of the decision basis usage problem, we can find a corresponding equivalent instance {UPA, PA, δ } of the decision role usage problem with the following mapping: UPA = A, PA = C, and δ = δ. Thus, there exists a solution Xk×n such that A − C ⊗ X1 δ if and only if there exists a solution UA such that UPA − UA ⊗ PA1 δ. Since the basis usage problem is known to be NP-hard, the role usage problem is also NP-hard. Second, we prove the hardness-of-approximation of the role usage problem (the study for the basis usage problem can be found in [19]). The role usage problem 10 H. Lu et al. / An optimization framework for role mining can be related to the ±PSC (positive-negative partial set cover) problem. The ±PSC problem is defined as follows: given a collection S of subsets of positive elements P and negative elements N , find a subcollection C minimizing |P \ (∪C)| + |N ∩ (∪C)|, where the |·| operator returns the count of elements in a set. Miettinen [20] shows that 1−ε it is NP-hard to approximate the ±PSC problem to within a factor of Ω(2log |P | ) for any ε > 0, where |P | is the number of positive elements. For each instance {P , N , S} of the ±PSC problem, we can create a corresponding equivalent instance {UPAm×n , PAk×n } of the role usage problem as follows: • let m (the number of users) be 1, n (the number of permissions) be |P | + |N |; • let the first |P | permissions correspond to the |P | positive elements and the remaining |N | permissions correspond to the |N | negative elements; • let UPA be a 1 × (|P | + |N |) Boolean vector with the first m elements of being 1 and the rest being 0 (i.e. UPA corresponds to all positive elements); • let k be |S| (the size of the subcollection of subsets); • create a row vector for PAk×n corresponding to each subset in S such that if the subset contains element i (positive or negative), make the vector component corresponding to the element as 1 and the rest vector components as 0. Given the mapping, a subcollection C of the solution to the ±PSC problem instance can be mapped to UA1×k , the solution to the role usage problem instance, such that its k vector components correspond to the subsets in S respectively, i.e. the ith component being 1 if its corresponding subset is in C; otherwise 0. Therefore, the constructed role usage problem instance is equivalent to the ±PSC problem instance and both instances share the hardness-of-approximation result. Approximating each row vector of UPA by PA can be viewed as a ±PSC problem instance. So it is NP-hard to 1−ε approximate the role usage problem within a factor of Ω(2log P ), where P is the least number of permissions that a user has in UPA (i.e number of 1’s elements in a row vector of UPA). 2 4. Unified framework using integer linear programming Role mining variants share much commonality. For example, the only difference between the basic RMP and the edge RMP is in their objective function. The difference between the δ-approx RMP and the basic RMP is that the δ-approx RMP allows inexactness. A naturally arising thought is that if we can build a unified framework for all RMP variants, we then do not need to deal with each problem individually. Additionally when new RMP variants appear, system engineers do not need to start from scratch and can take advantage of algorithms that have been developed for existing RMP variants. As all RMP variants are essentially optimization problems, we propose to formulate them through integer linear programming. There are many benefits by connecting RMP variants with integer linear programming. First, optimization has been studied for more than 50 years. There are several H. Lu et al. / An optimization framework for role mining 11 good exact optimization algorithms, even for integer linear programming, such as branch-and-bound [13]. In addition, successful optimization software packages are easily obtainable, such as Matlab and the Neos server.1 Even though those RMP variants are proven NP-hard, small or medium size problems can still be solved through traditional optimization techniques. In addition to exact algorithms, approximation algorithms may be developed through LP-based techniques, such as dual-fitting [14], randomized rounding [9], and primal-dual schema [12]. The linear programming framework we will propose in fact can not only incorporate RMP variants, but also problems in other application domains, such as tiling database problems [7] and discrete basis problems [21]. 4.1. Usage RMP We start from the usage RMP as it is essentially a subproblem of the other RMP variants. In the usage RMP, the input is UPA and PA, and the goal is to find UA minimizing the reconstruction error. For simplicity of notation, we denote PA by the Boolean matrix R and UA by the Boolean matrix variable X. Then the role usage problem can be roughly represented as follows: minimize UPA − X ⊗ R1 . To formulate it as an explicit integer linear programming problem, we first formulate UPA − X ⊗ R = 0 and then relax it by tolerating errors. To do so, we let Ri denote role i and UPAi denote permissions assigned to user i. UPA − X ⊗ R = 0 implies that every user’s permission set should be represented as a union of some candidate roles. This can be modeled as follows: UPAi = Rt , (1) t∈si where si denotes the role subset assigned to user i. {si } can convert to a Boolean matrix X such that Xit = 1 if role t belongs to si , otherwise Xit = 0. The constraint essentially says that if some user has a particular permission, at least one role having that permission has to be assigned to that user. In turn, if that user does not have some permission, none of the roles having that permission can be assigned to it. So X ⊗ R = UPA can be transformed to the following equation set ⎧ q ⎪ ⎪ ⎪ Xit Rtj 1, if UPAij = 1, ⎪ ⎨ t=1 q ⎪ ⎪ ⎪ ⎪ Xit Rtj = 0, if UPAij = 0. ⎩ t=1 1 http://www-neos.mcs.anl.gov/. (2) 12 H. Lu et al. / An optimization framework for role mining To formulate inexactness, we introduce a non-negative slack variable {Vij } to each constraint and have the following modified constraints. ⎧ q ⎪ ⎪ ⎪ Xit Rtj + Vij 1, if UPAij = 1, ⎪ ⎨ t=1 (3) q ⎪ ⎪ ⎪ ⎪ Xit Rtj − Vij = 0, if UPAij = 0. ⎩ t=1 q q For UPAij = t=1 Xit Rtj + Vij 1allows t=1 Xit Rtj to be 0. For q1, q UPAij = 0, t=1 Xit Rtj − Vij = 0 allows t=1 Xit Rtj to be greater than 1. In other words, if Vij > 0, the constraint enforcing whether UPAij = 1 or UPAij = 0 is not satisfied. The objective function UPA − X ⊗ R1 is then the count of unsatisfied constraints for UPA = X ⊗ R. To formulate the objective function, we need to count the positive variables {Vij }. To do so, we introduce another Boolean variable set {Uit } and enforce the following constraints M Uij Vij , Uij Vij , where the big M is a large constant greater than q. The above two inequalities ensure that if Vij 1, Uij = 1 and if Vij = 0, Uij = 0. Thus, the count of positive {Vij } is ij Uij . Therefore, the integer linear programming formulation for δ-approx RMP is as follows: Uij minimize ij ⎧ q ⎪ ⎪ ⎪ Xit Rtj + Vij 1, ⎪ ⎪ ⎪ ⎪ t= 1 ⎪ ⎪ q ⎪ ⎨ Xit Rtj − Vij = 0, ⎪ ⎪ t= 1 ⎪ ⎪ ⎪ M Uij − Vij 0, ⎪ ⎪ ⎪ ⎪ Vij , U ⎪ ⎩ ij Xit , Uij ∈ {0, 1}, Vij 0. if UPAij = 1, if UPAij = 0, (4) ∀i, j, ∀i, j, 4.2. Basic RMP With the ILP formulation for the usage RMP, it is easy to formulate others. Consider the basic RMP, where the input is UPA, and the goal is to find UA and PA. It can be succinctly modeled as an optimization problem as follows: minimize |Roles| s.t. UA ⊗ PA = UPA. H. Lu et al. / An optimization framework for role mining 13 For simplicity, we break up the basic RMP into two subproblems: (i) find a candidate role set {R1 , R2 , . . . , Rq }; (ii) locate a minimum candidate role subset to form PA and then determine UA. A role is nothing, but a permission subset. Given n permissions, there are 2n possible roles. But most of them can be easily eliminated. For example, if none of users had both permissions i and j, any permission subset containing both permissions i and j can never be a candidate role. We will explicitly discuss the generation of candidate roles in the next section. Here assume we have already produced a candidate role set {R1 , R2 , . . . , Rq }. Then consider the second subproblem, which is similar to the usage RMP. Every user’s permission set should be able to be represented as a union of some candidate roles. This can be phrased as the following: UPAi = Rt , (5) t∈si where si denotes the candidate role subset assigned to user i and UPAi denotes the permission subset assigned to user i. {si } can convert to a Boolean matrix X such that Xit = 1 if candidate role t belongs to si , otherwise Xit = 0. As both X and R are Boolean matrices, their relation can be represented by the following Boolean matrix multiplication, Xn×q ⊗ Rq×m = UPAn×n , (6) where R and UPA are given and X is unknown. As Xit indicates whether Rj is assigned to user i, so if ∀t Xit = 0, it means that Rt is never employed. So the basic RMP is essentially to minimize the number of non-zero columns in X. Same as the usage RMP, the constraint X ⊗ R = UPA can be enforced by ⎧ q ⎪ ⎪ ⎪ Xit Rtj 1, if UPAij = 1, ⎪ ⎨ t=1 q ⎪ ⎪ ⎪ ⎪ Xit Rtj = 0, if UPAij = 0. ⎩ (7) t=1 Now we need to count the number of non-zero columns in X. To do so, we introduce a set of Boolean indicator variables {d 1 , . . . , dq }, where dt = 1 indicates role q t is present, otherwise not. Then |Roles| = t=1 dt . Since dt indicates the presence of a role, dt should only be 1 when at least one user is assigned role t. Thus, dt = 1 when at least one of {X1t , . . . , Xmt } is 1. We can formulate this by adding in the constraints dt Xit , ∀i, t. (8) 14 H. Lu et al. / An optimization framework for role mining Finally, putting every thing together, the basic RMP is formulated as follows: minimize dt t ⎧ q ⎪ ⎪ ⎪ Xit Rtj 1, if UPAij = 1, ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ t=q 1 Xit Rtj = 0, if UPAij = 0, ⎪ ⎪ ⎪ ⎪ t=1 ⎪ ⎪ ⎪ d Xij , ∀i, j, ⎪ ⎩ t dt , Xij ∈ {0, 1}. (9) 4.3. δ-approximate RMP The goal of the δ-approx RMP problem is to minimize the number of required roles amount is formulated within a tolerable amount of error. In Eq. (4), the error as ij Uij . In Eq. (9), the number of roles is formulated as t dt . By combining these two equation systems together, we easily obtain the formulation for the δapproximate RMP problem as Eq. (10): minimize dt t ⎧ q ⎪ ⎪ ⎪ Xit Rtj + Vij 1, ⎪ ⎪ ⎪ ⎪ t=1 ⎪ ⎪ ⎪ q ⎪ ⎪ ⎪ ⎪ Xit Rtj − Vij = 0, ⎪ ⎪ ⎪ ⎨ t=1 MU ij − Vij 0, ⎪ ⎪ ⎪ Uij Vij , ⎪ ⎪ ⎪ ⎪ Xit , d t ⎪ ⎪ ⎪ ⎪ ⎪ Uij δ, ⎪ ⎪ ⎪ ⎪ i j ⎪ ⎩ dj , Xit , Uij ∈ {0, 1}, Vij 0, if UPAij = 1, if UPAij = 0, ∀i, j, ∀i, j, ∀i, t, (10) m where n i=1 j= 1 Uij δ ensures the total deviating cells less than δ and the q objective function t=1 dt is the number of mined roles. 4.4. Edge RMP Unlike the basic RMP, the goal of the edge RMP is to minimize the administrative cost. As a RBAC system only needs to maintain explicit user-to-role assignments and H. Lu et al. / An optimization framework for role mining 15 explicit role-to-permission assignments, the administrative cost is hence evaluated by positive cells in UA and PA. By employing the same notations in the ILP formulation for the basic RMP as in Eq. (9), the administrative cost can be formulated as the following UA1 + PA1 = i t Xit + t dt Rtj , (11) j where i t Xit is the count of positive cells in UA and t (dt j Rtj ) counts positive cells in selected candidate roles, which is PA1 . Thus, the ILP formulation for the edge RMP can be obtained by replacing the objective function in Eq. (9) with the formula in Eq. (11). Finally, in cases where one evaluates the administrative cost as the number of assignments in UA, PA, and the number of roles, the objective is to minimize |UA| + |PA| + |UPA| [3,38]. This can also be easily obtained by combining the objective function in Eqs (9) and (11). 4.5. Discussion From the ILP formulating processes for those role mining variants, we observe that once one role mining variant is successfully formulated, it is easy to formulate other variants. This fact illustrates the benefit of building a unified ILP framework of the role mining problem. Such an ILP framework is broad and flexible to incorporate new variants, which may appear in practice. For instance, a system engineer may not want to see 0-becoming-1 errors in the δ-approx RMP problem. In the RBAC context, 0-becoming-1 errors mean that a user obtains unauthorized permissions, which may harm the system security severely. To reflect this concern the ILP formulation, we in q could simplyreplace the constraint for UPAij = 0 of t=1 Xit Rtj − Vij = 0 in q Eq. (10) by t=1 Xit Rtj = 0. Consider another instance that all roles are expected to be highly repetitively used. To realize this expectation, we could add a constraint that i Xit LB, where i Xit is the number of users granted role t and LB is a specified lower bound. This will ensure that every picked role is employed more than LB times. Thus, other variants of the role mining problems can be created simply by adding more constraints or modifying the objective function. This is why we consider the optimization framework to be a general framework capable of modeling various matrix decomposition problems. 5. Heuristics The ILP formulation for the usage RMP takes mn variables and for other RMP variants takes about mq variables, where m is the number of users, n is the number of permissions, and q is the number of candidate roles. The state-of-art ILP software 16 H. Lu et al. / An optimization framework for role mining packages can deal with up to millions variables. So for problems of sizes with m and q greater than 1000, we have to resort to efficient heuristics. In this section, we will propose efficient heuristics for RMP variants. They are easy to implement and run fast. Their effectiveness will be validated in the next section. Like the above ILP formulations that require candidate roles to be given, our heuristics need candidate roles generated beforehand as well. So before presenting our heuristics, we will first discuss ways of generating candidate roles. 5.1. Candidate role set generation We present three ways of generating candidate roles. All of these have been proposed in prior work either in the role mining context or in the discrete basis context. Note that any other technique could also be used, since it is merely the first step of our approach. Intersection. We call the first method Intersection, which is proposed by Vaidya et al. in [37]. This method includes every unique user’s permission set as a candidate role. In addition the intersections of every two user permission sets are included as candidate roles as well. This method is based on two observations. First in order to make a permission subset be a role, it must be employed and assigned to some user. In other words, a candidate role must be a subset of some user’s permission set. The other observation is that to make a RBAC system succinct and efficient, roles should be repetitively employed. So a role is desired to be assigned to multiple users. Hence, a candidate role is expected to be the intersection of two or multiple user’s permission sets. Given m users, candidate roles generated by the Intersection method is O(m2 ). Association. This method exploits the correlations between the columns of UPA by employing the association rule mining concept in [26]. It was presented as a part of the ASSO algorithm proposed for the discrete basis problem. The concrete generation process is as follows. Suppose the user-to-permission assignment UPAm×n is given. Let UPA(:, i) denote the ith column. Then n candidate roles will be generated, represented by a Boolean matrix Cn×n . In which, Cij = 1 if UPA(:, i), UPA(:, j) \ UPA(:, i), UPA(:, i) τ , (12) otherwise 0, where ·, · is the vector inner product operation, and τ is a predetermined threshold controlling the level of correlation. Itself. This method is to simply treat unique user’s permission sets from UPA as candidate roles. It was studied in [19], but in the scenario of the discrete basis problem. The problem is to find a discrete basis for an input Boolean matrix, such that the discrete basis is a subset of columns (or rows) of the input Boolean matrix. In the role mining context, if employing this method to generate candidate roles, it is assumed that for every role there must be one user who is granted this role and only this role. While this may not necessarily make sense in the role mining context, it can indeed be meaningful in other domains such as text mining. H. Lu et al. / An optimization framework for role mining 17 Algorithm 1. Greedy algorithm for usage RMP Input: UPA1×n and PAr×n Output: UA1×r 1: UA1×r ← (0, . . . , 0); 2: for i ← 1 : r do 3: if PAi ⊆ UPA then 4: UA[i] = 1; 5: end if 6: end for 7: loop 8: PAi = the best remaining role, j = the largest improvement; 9: if j > 0 then 10: UA[i] = 1; 11: else 12: break 13: end if 14: end loop 5.2. Usage RMP As we mentioned earlier, the usage RMP is equivalent to the basis usage problem. A heuristic called Loc & IterX was proposed for the basis usage problem in [19]. Here we propose two more effective heuristics, one greedy approach and one simulated annealing approach. Without loss of generality, we assume that there is only one user. Hence UPA is 1 × n. A role set PAr×n is given. The usage RMP becomes to find a subset of r roles, such that the union of permissions contained in those roles is closest to UPA. Greedy. A greedy algorithm is any algorithm that makes the locally optimal choice at each stage with the hope of finding the global optimum. It has many successful applications, such as the traveling salesman problem and the knapsack problem. Our greedy approach consists of a preliminary step. It is to include all roles in PA which are subordinate to UPA into UA. A role is considered subordinate to UPA, if all permissions in the role belong to UPA. After that, iteratively pick one remaining role which reduces the reconstruction error by the most, till the reconstruction error cannot be reduced any more. Simulated annealing. Simulated annealing is a generic probabilistic heuristic for the global optimization problem. It locates a good approximation to the global optimum of a given function in a large search space. Unlike greedy heuristics, simulated annealing is a generalization of a Markov chain Monte Carlo method, which has a solid theoretical foundation. We present a simulated annealing heuristic for the role usage problem as follows: First, we let (0, . . . , 0) be the starting state of UA1×r . 18 H. Lu et al. / An optimization framework for role mining Then at each stage, randomly select its neighboring value by randomly picking one element of UA and flipping its value from 0 to 1 or from 1 to 0. If the new UA better reconstructs UPA, the next state is the new UA. If not, with a certain probability less than 1, the next state is the new UA. In other words, with certain probability, it remains its original state. This property reduces the chance of being stuck at a local optimum. The procedure described above allows a solution state to move to another solution state and hence produces a Markov chain. Denote the nth state by x and the randomly selected neighboring value be y. If the next state is y with probability exp{λV (x)/N (x)} min 1, exp{λV (y)/N (y)} or it remains x, where λ is a constant, V (t) is the reconstruction error with the solution t, and N (t) is the number of neighboring values of t. Such a Markov chain has a limiting probability of 1 for arriving at optimal minimization solutions when λ → ∞ [29]. But it has been found to be more useful or efficient to allow the value of λ to change with time. Simulated annealing is a popular variation of the preceding procedure. Here, we adopt the formula proposed by Besag et al. [27] and let the transition probability be exp{λn V (x)/N (x)} , min 1, exp{λn V (y)/N (y)} where λn = log(1 + n). In our case, N (x) and N (y) are equivalent and are canceled out in the formula. As computing time is limited, we terminate the algorithm after a certain number of iterations regardless of whether or not the global optimum is reached. Our complete simulated annealing algorithm is as described in Algorithm 2. 5.3. Basic RMP The heuristic proposed for the basic RMP also runs in a greedy fashion. Consider the ILP formulation for the basic RMP given in Eq. (9). There are two main types of constraints, one for {UPAij = 1} and the other for {UPAij = 0}. In order to satisfy q the constraint set { t=1 Xit Rtj = 0, if UPAit = 0}, if Rjt = 1, Xit must be equal to 0. Therefore, many variables Xit can be determined directly in this way. Then the q constraint set { t= 1qXit Rtj = 0, if UPAit = 0} is all satisfied. Next consider { t=1 Xit Rtj 1, if UPAit = 1}. To satisfy it, if the particular cells Rt1 j , Rt2 ,j , . . . , Rtk ,j are 1, one of the associated cells Xit1 , Xit2 , . . . , Xitk has to be 1. q Now, consider the objective function min t=1 dt . As dt is determined by the constraint dt Xit . Hence once one of the cells {Xit , ∀t} is 1, dt is forced to be 1. In other words, no matter how many variables in {Xit , ∀t} take the value of 1, the objective function value is always increased by 1. Our greedy algorithm H. Lu et al. / An optimization framework for role mining 19 Algorithm 2. Simulated annealing algorithm for usage RMP Input: UPA1×n and PAr×n Output: UA1×r 1: UA, x ← (0, . . . , 0); n ← 1; 2: while n limit and V (UA) > 0 do 3: y = a random neighboring value of x; 4: if V (y) < V (UA) then 5: UA ← y; 6: end if 7: x = y with probability exp{log(1+n)V (x)} min{1, exp{ C log(1+n)V (y )} }; 8: n ← n + 1; 9: end while utilizes this property. At each step, choose a candidate role Rt such that by setting q {Xit , ∀t}, except those predetermined, be 1, the most remaining constraints { t=1 Xit Rtj = 1, if UPAij = 1} are satisfied. We call the count of such satisfied remaining constraints the basic-key. Definition 5 (Basic-key). For each column of the variable matrix {Xit }, the count q of the constraints { t=1 Xij Rtj = 1, if UPAij = 1} being satisfied by letting the cells of such a variable column be 1 except the cells predetermined, is called its basic-key. If there are multiple columns with the same greatest basic-key, we simply choose the column with the least index. Once a column is chosen, it means that the associated candidate role is chosen and the associated dt is set to be 1. Then delete the satisfied constraints and perform the same procedure repetitively till all the constraints are satisfied. The full procedure of this greedy algorithm is given in Algorithm 3. 5.4. δ-approximate RMP From the ILP formulation perspective, the δ-approximate RMP seems much more complicated than the basic RMP. However an efficient greedy heuristic for the δapproximate RMP can be easily developed by modifying the greedy heuristic for the basic RMP. As the δ-approximate RMP tolerates δ amount of errors, we can terminate Algorithm 3 early once the remaining uncovered 1’s cells are less than δ. 5.5. Edge RMP Remember that the goal of the edge RMP is to minimize the administrative cost UA1 + PA1 , while the basic RMP only aims to minimize the number of roles. 20 H. Lu et al. / An optimization framework for role mining Algorithm 3. Greedy algorithm for basic RMP Input: Unique user-to-permission assignments UPAm×n and candidate roles {R1 , . . . , Rq }. Output: UA and PA 1: PA = ∅; q 2: Investigate the constraints { t=1 Xit Rtj = 0} in Eq. (9) and determine some of Xit to be 0; 3: Select the candidate role Rt with the greatest basic-key value and include it in PA. 4: Let undetermined values in {Xit } be 1 and deleter hence satisfied constraints in { qt=1 Xit Rtj > 1}; q 5: Go back to step 3 till all constraints of { t=1 Xit Rtj = 0} and q { t=1 Xit Rtj > 0} are satisfied. 6: Set the remaining variables in X to be 0. 7: UA is the subset of rows in X corresponding to selected roles in PA. So a basic RMP optimal solution is not necessarily an edge RMP solution. The edge RMP was studied in [36], where a greedy heuristic was proposed. To distinguish it, we call it Edge-Key. Here we propose a new heuristic, called Two-Stage. The edge RMP objective consists of two parts UA1 and PA1 . Intuitively lower number of roles leads to a lower value of PA. In this sense, an optimal solution of basic RMP can be a reasonable solution for edge RMP. Our greedy heuristic for basic RMP is able to produce a good PA. However, UA produced by the greedy heuristic is just a byproduct and not in its optimal form. We can start a second phase and reduce user-role assignments by reassigning obtained roles of PA to each user. It is essentially another basic RMP problem that assigns the minimum roles from PA to exactly cover all permissions for a user. Therefore, we can again adopt our greedy heuristic. It is obvious that if a role contains permissions not originally possessed by a user, the role can never be assigned to the user. With this rationale, we can simplify the second-phase task by filtering out unlikely roles from PA in advance. The complete algorithm is as described in Algorithm 4. 6. Experimental evaluation In this section, we conduct extensive experiments on both synthetic data sets and real data sets to evaluate the performance of our heuristics. 6.1. Synthetic data The synthetic data sets are generated as follows. Firstly, generate a set of unique roles PAr×n and user-to-permission assignments UAm×r in a random fashion. UPA H. Lu et al. / An optimization framework for role mining 21 Algorithm 4. Two-stage algorithm for edge-RMP Input: User-permission assignments UPAm×n and candidate roles {R1 , . . . , Rq }. Output: UA1 + PA1 1: Run Algorithm 3 with {R1 , . . . , Rq } to obtain PA. 2: for each UPAi do 3: Remove roles that contain permissions not belonging to UPAi from PA. 4: Run Algorithm 3 with remaining roles in PA to obtain UAi . 5: end for Table 1 Notations m n r ρ1 ρ2 # of users # of permissions # of roles Density of PA Density of UA noise limit Noise percentage Maximum number of iterations is the Boolean product of PA and UA. Two parameters ρ1 and ρ2 are employed to control the density of 1’s cells in PA and UA respectively, which then determine the density of 1’s cells in UPA. Precisely, to create a role, generate a random number Poissrnd(ρ1 ) from a Poisson distribution with the mean of ρ1 × n. If the generated number is greater than n, perform it again. Then randomly generate a Boolean vector with Poissrnd(ρ1 ) 1’s elements. We generate a user’s role assignment in a similar way, except replacing ρ1 with ρ2 . To reflect real data sets, we also add in noise by flipping the values for a portion of cells. noise is the noise percentage parameter. For convenience of reference, we list all notations in Table 1. In which, the parameter limit is used in Algorithm 2 to control the maximum number of iterations. The first experiment is to study the usage RMP. We compare the Loc & IterX algorithm proposed in [19] with our greedy heuristic and our simulated annealing (SA) heuristic with limit of 500. Three algorithms are compared with respect to the reconstruction error ratio, which is defined as the following. Error Ratio = UPA − UPA 1 /UPA1 , where UPA denotes the reconstructed UPA. The concrete experimental design is as follows. (i) Generate a number of data sets {UA, PA, UPA} with m = 50, n = 50, ρ1 = 0.3, ρ2 = 0.3 and noise varying from 0 to 0.2; (ii) Run the Loc & IterX, the greedy algorithm, and the SA algorithm with 22 H. Lu et al. / An optimization framework for role mining Fig. 1. Experimental results on usage RMP. (a) ρ1 = 0.3, ρ2 = 0.3, m = 50, n = 50, r = 10, limit = 500. (b) ρ1 = 0.3, ρ2 = 0.3, m = 1, n = 50, r = 30, noise = 0.2. (c) ρ1 = 0.3, ρ2 = 0.3, m = 1, n = 50, r = 30, noise = 0.2. (Colors are visible in the online version of the article; http:// dx.doi.org/10.3233/JCS-130484.) limit of 500 respectively to obtain UA and UPA given PA and UPA. The experiment result is illustrated as shown in Fig. 1(a). It shows the greedy algorithm and the SA algorithm are significantly better than the Loc & IterX algorithm. The other observation is that the performances of the greedy heuristic and the SA heuristic are comparable. Unlike the greedy heuristic, which returns a local optimum in most cases, a SA heuristic can with probability one reach a global optimum given enough computing time; in other words unlimited number of iterations. However, the meaning of existence for a heuristic is that it can get a good solution in an acceptable time. So we conduct another experiment to investigate how our SA heuristic performs with respect to limit. We generate 100 sets of {UA, PA, UPA} with ρ1 = 0.3, ρ2 = 0.3, m = 1, n = 50, r = 30 and noise = 0.2. limit varies from 50 to 2500. Run both the greedy heuristic and the SA heuristic against them. Figure 1(b) shows that the SA heuristic has better performance than the greedy heuristic when limit is greater than 500. It also shows at the beginning small increase in limit can improve the performance by a lot for the SA heuristic. However, when limit reaches certain point, the improvement becomes much less significant. It somehow shows that the greedy heuristic has relatively satisfactory performance. Figure 1(c) plots computing times for both methods. The computing time for the SA heuristic increases linearly with limit. However, to achieve the same performance as the greedy heuristic, the SA heuristic takes more time. So we conclude that the SA heuristic is recommended for small-size problems while the greedy heuristic is suitable for large-size problems. Next we study our greedy heuristic for basic RMP. The heuristic is closely dependent on candidate roles. Three ways of generating candidate roles were introduced: Itself, Intersection and Association. As Association has a parameter of association threshold τ , for a fair comparison, we consider two cases, τ = 0.9 and τ = 0.7. We generate {UPA} with ρ1 = 0.3, ρ2 = 0.3, m = 50, n = 50, r = 10, noise = 0. Then run Algorithm 3 with each candidate role set respectively to find UA and PA. The results are illustrated as shown in Fig. 2(a). Both Intersection and Itself are able to reconstruct UPA completely, but Association. With respect to the error reducing H. Lu et al. / An optimization framework for role mining 23 Fig. 2. Experimental results on basic RMP and δ-approximate RMP. (a) ρ1 , ρ2 = 0.3, m, n = 50, r = 10, noise = 0. (b) ρ1 , ρ2 = 0.3, m = 1, n = 50, r = 30, noise = 0.2. (c) ρ1 , ρ2 = 0.3, m, n = 50, r = 10, noise = 0. (d) m, n = 50, r = 10, noise = 0. (e) ρ1 , ρ2 = 0.3, r = 10, noise = 0. (Colors are visible in the online version of the article; http://dx.doi.org/10.3233/JCS-130484.) speed, Association is inferior to two other approaches. Overall Intersection has the best performance, while the performance of Association is far from satisfactory. So ignoring Association, we then further study Itself and Intersection with respect to computing time. Figure 2(b) shows that Intersection takes much more time than Itself. The underlying reason is that Intersection produces O(m2 ) candidate roles as opposed to O(n) candidate roles produced by Itself. We conclude that for small size problems Intersection is preferred while Itself is recommended for large size problems. Our greedy heuristic for δ-approximate RMP is same as that for basic RMP, except that it terminates early. Therefore, the previous experimental results are valid for studying δ-approximate RMP. Figure 2(c) plots reconstruction error ratio values with respect to required number of roles. If a small amount of errors are allowed, UPA can be successfully reconstructed with much less number of roles. For example, Itself needs only 10 roles to cover more than 80 percent of existing permissions, while 20 extra roles are needed to cover the remaining permissions. Somehow it can be interpreted that roles returned by δ-approximate RMP are more “fundamental”. It suggests that if a bottom-up role engineering approach is only used to identify promising roles to assist a top-down engineering approach, δ-approximate RMP is more efficient. Figure 2(d) shows the relation between data density and the number of required roles for full coverage. As the data density is indirectly determined by ρ1 and ρ2 . The experiment is to let ρ1 and ρ2 have the same value and vary them together 24 H. Lu et al. / An optimization framework for role mining from 0.1 to 0.6. In Fig. 2(d), both lines suggest that for denser data sets more roles are required. In other words, our greedy heuristic performs better for sparse data sets, while real access control data sets are usually very sparse. We then study the relation between data size and our greedy heuristic performance. We let m and n be same and vary them from 40 to 100. Figure 2(e) illustrates that our heuristic performs good for medium and small sizes. Next we evaluate the Two-Stage algorithm proposed for edge RMP by comparing it with the Edge-Key algorithm proposed in [36]. For convenience, we let default parameters for generating UPA be {m = 50, n = 50, ρ1 = 0.3, ρ2 = 0.3, noise = 0}. We then vary one parameter each time, generating a set of UPAs, and run TwoStage and Edge-Key respectively on them. As both algorithms require input candidate roles, we consider both Itself and Intersection. Hence, we actually compare four approaches: (Two-Stage, Itself ), (Two-Stage, Intersection), (Edge-Key, Itself ) and (Edge-Key, Intersection). Experimental results are illustrated as shown in Fig. 3. All four graphs suggest that Two-Stage is better than Edge-Key. We also compare our heuristics with the neural network algorithm and the genetic algorithm proposed in [32]. These two algorithms can only apply to the δ-approximate RMP problem, as they do not guarantee returning an exact Boolean matrix decomposition solution. So the comparison is only with respect to the δ-approximate RMP problem. Before we present the experiment results, we briefly introduce the two algorithms to be compared with. The Neural Network algorithm is based on a Hopfield-like neural network of n fully connected neurons, where n is corresponding to the number of permissions in the role mining setting. The weights between neurons are learned from the training dataset UPA based on the Hebbian rule. To reveal factors (roles), Fig. 3. Reconstructed error ratio w.r.t. percentage of required roles for real data sets. (Colors are visible in the online version of the article; http://dx.doi.org/10.3233/JCS-130484.) H. Lu et al. / An optimization framework for role mining 25 a random pattern (a user’s assignment) is fed to the neural network and the resultant attractor is a factor. The resultant attractors might be spurious. So as suggested by [32], different random patterns with the number of 1’s elements ranging from 1 to n are fed into the neural network. For each random pattern, attractors with different number of 1’s elements are derived by adjusting the threshold function. At the end, k attractors among all found attractors with the minimum Lyapunov function values are picked as the final role set. The Genetic algorithm starts with an initial population of solutions and lets them evolve with crossover and mutation operators towards better solutions in terms of a fitness function. The initial population is randomly generated. Our experiment will use the same algorithmic parameter values as in [32]. The experimental design is as follows. We randomly generate a set of Boolean matrices UPA, UA and PA with the same procedure as described before. Then we separately apply our heuristics, the neural network algorithm and the genetic algorithm to recover the original role set PA from the given UPA. We compare all the discovered role sets in terms of the closeness to the real role set. The closeness between the original role set PAk×n and the derived role set PAk×n is defined as the sum of the hamming distances between each real role and its closest derived role. It is mathematically expressed as: k i=1 PAij − PAtj . min 1tk (13) j To evaluate those algorithms, we conduct experiments on two sets of data. The first set of {UA, PA, UPA} is generated by letting r = 10, ρ1 = 0.3, ρ2 = 0.3 and noise = 0 while varying m and n from 30 to 90. The results are reported in Table 2, where the best solutions are marked in bold. Among all algorithms, Intersection has the best performance and discovers all the original roles for the four out of seven cases. Both Intersection and Itself perform better than Neural Network and Genetic. The experimental result demonstrates the ability of our algorithms in revealing the underlying roles in the Boolean matrix of user-permission assignments. We did not observe clear association between the algorithm performance and the data size. Intuitively, with the fixed number of roles (patterns), it would be relatively easy to detect roles (patterns) for the large Boolean matrix (training data). However, as the data size grows, the number of potential patterns grows exponentially, which would Table 2 Comparison results w.r.t. closeness with varying data size and fixed number of roles m, n 30 40 50 60 70 80 Intersection 0 0 0 220 10 240 0 Itself Neural network Genetic 0 0 20 100 60 50 0 50 30 220 220 260 230 190 210 240 260 322 0 120 230 90 26 H. Lu et al. / An optimization framework for role mining Table 3 Comparison results w.r.t. closeness with fixed data size and varying number of roles r 5 10 15 20 25 30 Intersection 100 160 30 260 350 480 Itself Neural Network Genetic 100 120 100 100 170 225 195 60 195 300 420 315 575 590 480 660 525 580 counter the positive effect brought by the growth of training data. The second set of {UA, PA, UPA} is generated by letting m = 50, n = 50, ρ1 = 0.3, ρ2 = 0.3 and noise = 0 while varying r from 5 to 10. The result is reported in Table 3. Intersection has the best performance for the five out of six cases. When r being 10, Itself performs the best. Both Intersection and Itself perform better than both Neural Network and Genetic for all cases. This experimental result again demonstrates the advantage of our heuristics in discovering roles. In Table 3, there is an approximate trend showing the larger value of r leads the worse algorithm performance in terms of closeness. One possible explanation is that when there are many underlying patterns more training data are required to discover them. In this experiment, since the values of m and n are fixed, it becomes difficult for an algorithm to discover many embedded roles. The neural network approach for the Boolean matrix decomposition itself is very interesting and provides a fresh perspective to the role mining problem. One possible explanation why it does not perform well in our experiment is that the Hopfield neural network has limited ability in storing patterns. When there are many roles to store, the derived Hopfield neural network may contain many spurious patterns (attractors) and there is no effective way to distinguish real and spurious patterns. However, the neural network algorithm might be very helpful for detecting noise in a set of user-permission assignments with known roles. We plan to look into this problem in the future. The genetic algorithm is a local-search heuristic, which is fundamentally same as the greedy heuristic and the simulated annealing heuristic used in our approaches. One explanation why the genetic algorithm does not work well in our setting is that there are too many algorithmic parameters in the genetic heuristic. The genetic heuristic tries to discover the two decomposed Boolean matrices at the same time. So given m users, n permission and k roles, there are m ∗ n ∗ p variables to determine simultaneously. In that case, thousands iterations in a typical genetic algorithm search process are not enough to discover a good solution. Differently, our approach discovers a good set of roles first and then try to reconstruct role assignment for each user individually. It takes less computation and hence increases the likelihood of finding a good solution in a limited amount of time. 6.2. Real data We run our greedy heuristics on real data sets collected by Ene et al. [3]. They are emea, healthcare, domino, firewall 1 and firewall 2. H. Lu et al. / An optimization framework for role mining 27 Fig. 4. Reconstructed error ratio w.r.t. number of required roles for real data sets. (Colors are visible in the online version of the article; http://dx.doi.org/10.3233/JCS-130484.) The first experiment is to study the basic RMP. We run Algorithm 3 against each data set with candidate roles generated by Itself and Intersection respectively. Numbers of roles required for full coverage are recorded in Fig. 4. The number of required roles is much less than the number of permissions for each case. It successfully demonstrates the power of RBAC and also the effectiveness of our heuristic on discovering roles. Another observation is that the performance of Intersection is not always better than Itself when working with our greedy heuristic. In theory, with Intersection the optimal solution of a basic RMP is better than that with Itself, as the feasible solution space is expanded. However, if the algorithm runs in a greedy manner, it is not always true. The second experiment is to study the δ-approximate RMP. We run Algorithm 3 and record remaining errors when a new role is identified. Figure 4 plots reconstruction error ratios with respect to number of required roles. The first observation is that all lines drop fast at the beginning. It suggests that a few roles are usually able to reduce error ratio to a very low level. Roles early identified can hence be interpreted as “more valuable”. Another observations is that much more roles are required to cover the remaining few 1’s at the end. It suggests hat if a role mining approach is used as a tool to assist a top-down role engineering approach, δ-approximate RMP might be sufficient. The other important observation is that Intersection dose not always perform better than Itself in our greedy heuristic. In fact, if only considering early mined roles, their performances are comparable. The third experiment is to study the edge RMP. We run (Two-Stage, Itself ), (TwoStage, Intersection), (Edge-Key, Itself ) and (Edge-Key, Intersection) respectively against each data set. The results are reported in Table 4. There are three key ob- 28 Data set emea healthcare domino firewall 1 firewall 2 Users 35 46 79 365 325 Permissions 3,046 46 231 709 590 UPA1 7,220 1,486 730 31,951 36,428 Itself Intersection # of roles Edge value Edge-Key Two-Stage # of roles Edge value Edge-Key Two-Stage 34 16 20 71 10 7,281 624 810 5,475 1,796 7,246 478 727 4,937 1,456 43 14 21 65 10 9,078 553 814 4,508 1,796 9,014 412 827 4,034 1,456 H. Lu et al. / An optimization framework for role mining Table 4 Number of roles and edge value for full coverage for real data sets H. Lu et al. / An optimization framework for role mining 29 servations. The first observation is that Intersection performs better than or equally to Itself for four out of the five data sets in terms of the number of required roles for complete coverage. This is to be expected since Intersection has a larger pool of candidate roles and runs in a greedy fashion. Therefore, more 1’s elements would be covered at each step than what would be covered by Itself. Thus, intuitively Intersection returns fewer roles for full coverage. However, it is not always true, as shown by emea, which is an exceptional case. Here, Intersection returns 43 roles, which is larger than the 34 roles returned by Itself. One reason for this is that both Itself and Intersection are local-search algorithms and do not guarantee global optimum solutions. Therefore, better initial steps may lead to more iterations and more roles in the end. However, the experience teaches us that when computing time is ample, an ensemble approach consisting of several different heuristics should be considered. The second observation is that the fewer roles always are associated with the less edge value. The reason for this is that the edge value consists of |UA| and |PA|. When the roles are fewer, the value of |PA| would likely be less. Since the number of roles is less, the size of user-role assignments UA would likely be less as well. Note that this is just an intuition and might not always be true, although there is no counterexample here. The third observation is that Two-Stage returns less edge values than Edge-Key for all cases except for the domino dataset when Intersection is used. One explanation is that Two-Stage has two steps of reducing the edge value (i.e. the first step is to find PA with small size and the second step is to find UA with small size by calling the role usage problem), while Edge-Key finds UA and PA simultaneously. Therefore, intuitively Two-Stage would have a better chance to find a RBAC solution with smaller edge value. 7. Conclusions and future work This paper studies the Boolean matrix decomposition problem with more attention on its application on the role mining problem. Four BMD variants with pragmatic implications in real application domains are explicitly presented. The main technical contribution of this paper is two-fold: First, it builds a unified framework for a variety of BMD variants by formulating them through ILP; Second, it presents efficient and effective heuristics for BMD variants, which are validated by extensive experiments. There are still several research issues that need to be explored. First, the performance of our heuristics largely depends on candidate role sets. To obtain a better candidate role set, it would be beneficial to incorporate expert knowledge on the semantics of internal business processes and narrow down the pool of promising candidate roles. Second, for large-scale problems, computation time is always an issue, even for heuristics. A data structure specialized for such computation may help reduce the burden. Third, many different constraints, which are not considered in this paper, may exist in practice. How to incorporate them in the heuristic designing process is worth investigating further. We plan to look at these problems in the future. 30 H. Lu et al. / An optimization framework for role mining References [1] A. Colantonio, R. Di Pietro, A. Ocello and N.V. Verde, Taming role mining complexity in RBAC, Computers & Security 29 (2010), 548–564. (Special Issue: Challenges for Security, Privacy & Trust.) [2] E. Coyne, Role engineering, in: Proceedings of the First ACM Symposium on Access Control Models and Technologies, 1995. [3] A. Ene, W. Horne, N. Milosavljevic, P. Rao, R. Schreiber and R.E. Tarjan, Fast exact and heuristic methods for role minimization problems, in: SACMAT’08: Proceedings of the 13th ACM Symposium on Access Control Models and Technologies, ACM, New York, NY, USA, 2008, pp. 1–10. [4] M. Frank, D. Basin and J.M. Buhmann, A class of probabilistic models for role engineering, in: Proceedings of the 15th ACM Conference on Computer and Communications Security, 2008. [5] M. Frank, A.P. Streich, D.A. Basin and J.M. Buhmann, A probabilistic approach to hybrid role mining, in: ACM Conference on Computer and Communications Security, 2009, pp. 101–111. [6] M.P. Gallagher, A. O’Connor and B. Kropp, The economic impact of role-based access control, Planning Report 02-1, National Institute of Standards and Technology, 2002. [7] F. Geerts, B. Goethals and T. Mielikainen, Tiling databases, in: Discovery Science, Springer, 2004, pp. 278–289. [8] G. Golub and C.V. Loan, Matrix Computations, 3rd edn, The Johns Hopkins Univ. Press, Baltimore, MD, USA, 1996. [9] D. Hochbaum, Approximation algorithms for the set covering and vertex cover problems, SIAM Journal on Computing 11(3) (1982), 555–556. [10] M. Koyutürk and A. Grama, Proximus: a framework for analyzing very high dimensional discreteattributed datasets, in: KDD’03: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, USA, 2003, pp. 147–156. [11] M. Kuhlmann, D. Shohat and G. Schimpf, Role mining – revealing business roles for security administration using data mining technology, in: SACMAT ’03: Proceedings of the Eighth ACM Symposium on Access Control Models and Technologies, ACM, New York, NY, USA, 2003, pp. 179–186. [12] K. Kuhn, The Hungarian method for the assignment problem, Navel Research Logistics Quarterly 2 (1955), 83–97. [13] A.H. Land and A.G. Doig, An automatic method of solving discrete programming problems, Econometrica 28 (1960), 497–520. [14] L. Lovasz, On the ratio of optimal integral and fractional covers, Discrete Mathematics 13 (1975), 383–390. [15] H. Lu, J. Vaidya and V. Atluri, Optimal Boolean matrix decomposition: Application to role engineering, in: ICDE’08: Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, IEEE Computer Society, Washington, DC, USA, 2008, pp. 297–306. [16] H. Lu, J. Vaidya, V. Atluri and Y. Hong, Extended Boolean matrix decomposition, in: ICDM, 2009, pp. 317–326. [17] H. Lu, J. Vaidya, V. Atluri and Y. Hong, Constraint-aware role mining via extended Boolean matrix decomposition, IEEE Trans. Dependable Sec. Comput. 9(5) (2012), 655–669. [18] H. Lu, J. Vaidya, V. Atluri, H. Shin and L. Jiang, Weighted rank-one binary matrix factorization, in: SDM, 2011, pp. 283–294. [19] P. Miettinen, The Boolean column and column-row matrix decompositions, Data Min. Knowl. Discov. 17(1) (2008), 39–56. [20] P. Miettinen, On the positive–negative partial set cover problem, Inf. Process. Lett. 108(4) (2008), 219–221. [21] P. Miettinen, T. Mielikainen, A. Gionis, G. Das and H. Mannila, The discrete basis problem, in: Knowledge Discovery in Databases: PKDD 2006 – 10th European Conference on Principles and Practice of Knowledge Discovery in Databases, Springer, 2006, pp. 335–346. H. Lu et al. / An optimization framework for role mining 31 [22] I. Molloy, H. Chen, T. Li, Q. Wang, N. Li, E. Bertino, S.B. Calo and J. Lobo, Mining roles with semantic meanings, in: SACMAT, 2008, pp. 21–30. [23] I. Molloy, N. Li, T. Li, Z. Mao, Q. Wang and J. Lobo, Evaluating role mining algorithms, in: SACMAT, 2009, pp. 95–104. [24] I. Molloy, N. Li, Y.A. Qi, J. Lobo and L. Dickens, Mining roles with noisy data, in: Proceedings of the 15th ACM Symposium on Access Control Models and Technologies, SACMAT’10, ACM, New York, NY, USA, 2010, pp. 45–54. [25] G. Neumann and M. Strembeck, A scenario-driven role engineering process for functional RBAC roles, in: SACMAT’02: Proceedings of the Seventh ACM Symposium on Access Control Models and Technologies, ACM, New York, NY, USA, 2002, pp. 33–42. [26] M. Pauli, M. Taneli, G. Aristides, D. Gautam and M. Heikki, The discrete basis problem, IEEE Transactions on Knowledge and Data Engineering 20(10) (2008), 1348–1362. [27] J.B. Peter, P. Green, D. Higdon and K. Mengersen, Bayesian computation and stochastic systems, Statistical Science 10(1) (1995), 3–41. [28] H. Roeckle, G. Schimpf and R. Weidinger, Process-oriented approach for role-finding to implement role-based security administration in a large industrial organization, in: RBAC’00: Proceedings of the Fifth ACM Workshop on Role-Based Access Control, ACM, New York, NY, USA, 2000, pp. 103– 110. [29] S.M. Ross, Simulation (Statistical Modeling and Decision Science), 3rd edn, Academic Press, 2002. [30] J. Schlegelmilch and U. Steffens, Role mining with ORCA, in: SACMAT’05: Proceedings of the Tenth ACM Symposium on Access Control Models and Technologies, ACM, New York, NY, USA, 2005, pp. 168–176. [31] B.-H. Shen, S. Ji and J. Ye, Mining discrete patterns via binary matrix factorization, in: KDD’09: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, USA, 2009, pp. 757–766. [32] V. Snasel, J. Platos, P. Kromer, D. Husek, R. Neruda and A.A. Frolov, Investigating Boolean matrix factorization, in: Proceeding of the Data Mining Using Matrices and Tensors Workshop, DMMT’08, 2008. [33] A.P. Streich, M. Frank, D.A. Basin and J.M. Buhmann, Multi-assignment clustering for Boolean data, in: ICML, 2009, p. 122. [34] J. Vaidya, V. Atluri and Q. Guo, The role mining problem: finding a minimal descriptive set of roles, in: SACMAT’07: Proceedings of the 12th ACM Symposium on Access Control Models and Technologies, ACM, New York, NY, USA, 2007, pp. 175–184. [35] J. Vaidya, V. Atluri, Q. Guo and N. Adam, Migrating to optimal RBAC with minimal perturbation, in: SACMAT’08: Proceedings of the 13th ACM Symposium on Access Control Models and Technologies, ACM, New York, NY, USA, 2008, pp. 11–20. [36] J. Vaidya, V. Atluri, Q. Guo and H. Lu, Edge-RMP: Minimizing administrative assignments for role-based access control, J. Comput. Secur. 17(2) (2009), 211–235. [37] J. Vaidya, V. Atluri and J. Warner, Roleminer: mining roles using subset enumeration, in: CCS’06: Proceedings of the 13th ACM Conference on Computer and Communications Security, ACM, New York, NY, USA, 2006, pp. 144–153. [38] D. Zhang, K. Ramamohanarao and T. Ebringer, Role engineering using graph optimisation, in: Proceedings of the 12th ACM Symposium on Access Control Models and Technologies, SACMAT’07, ACM, New York, NY, USA, 2007, pp. 139–144.