Data Shuffling: A New Masking Approach for Numerical Data

This article was downloaded by: [130.113.86.233] On: 23 February 2015, At: 06:45 Publisher: Institute for Operations Research and the Management Sciences (INFORMS) INFORMS is located in Maryland, USA Management Science Publication details, including instructions for authors and subscription information: http://pubsonline.informs.org Data Shuffling—A New Masking Approach for Numerical Data Krishnamurty Muralidhar, Rathindra Sarathy, To cite this article: Krishnamurty Muralidhar, Rathindra Sarathy, (2006) Data Shuffling—A New Masking Approach for Numerical Data. Management Science 52(5):658-670. http://dx.doi.org/10.1287/mnsc.1050.0503 Full terms and conditions of use: http://pubsonline.informs.org/page/terms-and-conditions This article may be used only for the purposes of research, teaching, and/or private study. Commercial use or systematic downloading (by robots or other automatic processes) is prohibited without explicit Publisher approval, unless otherwise noted. For more information, contact permissions@informs.org. The Publisher does not warrant or guarantee the article’s accuracy, completeness, merchantability, fitness for a particular purpose, or non-infringement. Descriptions of, or references to, products or publications, or inclusion of an advertisement in this article, neither constitutes nor implies a guarantee, endorsement, or support of claims made of that product, publication, or service. Copyright © 2006, INFORMS Please scroll down for article—it is on subsequent pages INFORMS is the largest professional society in the world for professionals in the fields of operations research, management science, and analytics. For more information on INFORMS, its publications, membership, or meetings visit http://www.informs.org MANAGEMENT SCIENCE informs Downloaded from informs.org by [130.113.86.233] on 23 February 2015, at 06:45 . For personal use only, all rights reserved. Vol. 52, No. 5, May 2006, pp. 658–670 issn 0025-1909 eissn 1526-5501 06 5205 0658 ® doi 10.1287/mnsc.1050.0503 © 2006 INFORMS Data Shuffling—A New Masking Approach for Numerical Data Krishnamurty Muralidhar Gatton College of Business and Economics, University of Kentucky, Lexington, Kentucky 40506, krishm@uky.edu Rathindra Sarathy Spears School of Business, Oklahoma State University, Stillwater, Oklahoma 74078, sarathy@okstate.edu T his study discusses a new procedure for masking confidential numerical data—a procedure called data shuffling—in which the values of the confidential variables are “shuffled” among observations. The shuffled data provides a high level of data utility and minimizes the risk of disclosure. From a practical perspective, data shuffling overcomes reservations about using perturbed or modified confidential data because it retains all the desirable properties of perturbation methods and performs better than other masking techniques in both data utility and disclosure risk. In addition, data shuffling can be implemented using only rank-order data, and thus provides a nonparametric method for masking. We illustrate the applicability of data shuffling for small and large data sets. Key words: camouflage; confidentiality; data masking; data swapping; obfuscation; privacy; perturbation History: Accepted by Ramayya Krishnan, information systems; received August 26, 2004. This paper was with the authors 3 months for 1 revision. 1. Introduction values that have been “modified” (Wall Street Journal 2001). This is likely to be particularly true of a typical user in commercial organizations who may not have the statistical sophistication of users of governmental data. Thus, techniques are needed that will foster greater acceptance of masked data among the common user. One data masking approach with the potential to satisfy this requirement is data swapping. As the name implies, data swapping exchanges values of confidential variables between records (Dalenius and Reiss 1982) without modifying the original values of the confidential variables. Data swapping has the intuitive appeal and ease of explanation that is not available with other masking techniques, but compared with perturbation methods, existing data swapping methods for numerical variables have low data utility or high disclosure risk (or both). Fienberg (2002) observes that other techniques have surpassed data swapping because it has performed poorly. However, given its potential for greater user acceptance, we believe that data swapping techniques must be brought to perform on par with other data masking techniques. Hence our objective in this paper is to develop a technique that provides data utility and disclosure risk that is similar to perturbed data, without modifying the original values of confidential variables. In this study, we develop such a new procedure and discuss its underlying theory. Referred to as data shuffling, this procedure will allow organizations to disseminate Organizations derive many benefits from gathering, analyzing, and disseminating data regarding customers, suppliers, and other entities. Simultaneously, these activities raise issues of privacy and confidentiality of sensitive information. Unrestricted analysis, dissemination, and sharing of sensitive data could lead to disclosure of confidential information, so organizations need analytically valid data that do not disclose confidential information. Until recently, this problem was important only for a few governmental agencies (such as the Census Bureau) that released specialized data sets for sophisticated users. Recently, however, the scope of the problem has expanded to cover practically all organizations. Research in statistical disclosure limitation techniques has led to the development of several tools and techniques that enable disseminated data to be analyzed while protecting individual privacy and confidentiality (Willenborg and de Waal 2001). We shall refer to these broadly as data masking techniques. Most of these techniques were developed in the context of data dissemination by a governmental agency. One such technique that enables data to be analyzed while preserving a high level of confidentiality is data perturbation. Perturbation techniques rely on “perturbing” or changing the original values in a manner that preserves analytical usability without compromising confidentiality. Unfortunately, many users look unfavorably (or possibly suspiciously) on 658 Muralidhar and Sarathy: Data Shuffling—A New Masking Approach for Numerical Data 659 Management Science 52(5), pp. 658–670, © 2006 INFORMS and share data for analysis, with minimal disclosure risk. Downloaded from informs.org by [130.113.86.233] on 23 February 2015, at 06:45 . For personal use only, all rights reserved. 2. Data Utility, Disclosure Risk, and the Conditional Distribution Approach Important in the development of new masking techniques is an evaluation of the associated level of data utility and disclosure risk. In this section, we formally define measures of data utility and disclosure risk. Similar to Duncan and Pearson (1991), we assume that users must be provided access to microdata (values of variables at the record level) without access restrictions. Data masking techniques attempt to provide users with access to microdata that provides analytical validity, but prevents disclosure of sensitive information.1 Analytical validity or data utility represents the extent to which analyzing the masked data provides similar results to those obtained by analyzing the original data. A masking technique provides the highest possible level of data utility if, for any type of analysis, the results using the original and masked data are identical. In practice, achieving this ideal is difficult. Instead, masking techniques are evaluated based on how closely they maintain the joint distribution of the confidential variables and nonconfidential variables. From a statistical perspective, it is important that inferences reached using the masked data must be the same as those using the original data. In terms of disclosure risk, while all masking techniques prevent exact disclosure of values, they are susceptible to inferential disclosure. Inferential disclosure risk (or simply disclosure risk) is measured by the ability of an intruder or snooper to infer the identity of an entity to whom a particular record belongs and/or the value of a confidential variable for a particular record. Inferential disclosure will occur even without access to masked microdata (using nonconfidential microdata and aggregate information) and cannot be prevented unless access to the entire data set is prevented. Hence, Dalenius (1977) and Duncan and Lambert (1986) define disclosure as the improvement in an intruder’s predictive ability when access is provided to the masked microdata. This improvement in predictive ability represents the disclosure risk that can be attributed exclusively to the masking procedure. By this definition, disclosure risk occurs when (1) the probability of correctly matching a particular deidentified record to an individual is higher with 1 The term “disclosure” is used to refer to disclosure of sensitive information to authorized users who use their legitimate access to compromise confidentiality. It does not refer to unauthorized users (or hackers). access to microdata, compared to the same probability without access to the masked microdata, and/or (2) the error bounds for the estimated value of a confidential variable are smaller with access to microdata compared to the error bounds without access to the masked microdata. 2.1. The Conditional Distribution Approach for Generating Masked Data Muralidhar and Sarathy (2003a, b) have shown that in order to satisfy the above requirements, the masked variables must be generated in a specific manner. Consider a data set with i = 1 N observations, consisting of a set of j = 1 M confidential variables X and a set of k = 1 L nonconfidential variables S.2 It is assumed that the variables X and S have a known pdf f X S . Let Y represent the masked variables. Using f X S , if the individual values in Y are generated from the conditional density of X given S; that is, (1) yi ∼ fXS X S = si then Y will be an independent realization from the conditional distribution of fXS X S . In other words, X and Y are conditionally independent given S. From Equation (1), we can show fY S Y S = fX S X S (2) thereby maintaining ideal data utility, and that fXS Y X S Y = fXS X S (3) thereby minimizing disclosure risk (Muralidhar and Sarathy 2003a). This ensures low disclosure risk for such procedures. In other words, S provides the same information about X as {S and Y} together. Hence, any procedure that generates masked values based on the conditional distribution in Equation (1) provides the lowest possible level of disclosure risk, since access to microdata with the masked variables Y provides the intruder with no additional information (Dalenius 1977, Duncan and Lambert 1986). We will refer to masking 2 For the remainder of this manuscript, we will use the following notation. Let xi j and x i j represent the ith (unordered) and ith-ordered observation for the jth confidential variable (i.e., rank(x i j = i). Similarly, let si k and s i k represent the ith (-unordered) and ith-ordered observation of the kth nonconfidential variable. Let YP represent the perturbed set of j = 1 M confidential variables and Y represent the shuffled variables. As with X, yiP j and yi j represent the ith observations of the jth perturbed and shuffled variables, respectively, and y Pi j and y i j the ordered observations. Let xi , si , and yiP , yi , represent 1×M , 1×L , 1×M , and 1×M single-observation vectors from X, S, YP , and Y, respectively. Let Xj , Sk , YjP , and Yj represent the N × 1 vector of a single variable in X, S, YP , and Y, respectively, and X j , S k , Y Pj , and Y j to represent the rank-ordered N × 1 vector of a single variable. Let f and F represent the probability density and cumulative distribution functions, respectively. Muralidhar and Sarathy: Data Shuffling—A New Masking Approach for Numerical Data Downloaded from informs.org by [130.113.86.233] on 23 February 2015, at 06:45 . For personal use only, all rights reserved. 660 Management Science 52(5), pp. 658–670, © 2006 INFORMS procedures that satisfy Equation (1) as being based on the conditional distribution approach. Using the conditional distribution fXS X S for generating masked values has been suggested by Fienberg et al. (1998a), Fuller (1993), Little (1993), and Rubin (1993). Our objective is to present a data swapping methodology based on the conditional distribution approach (albeit with certain assumptions). Prior to describing our new approach, we briefly describe existing methods for swapping numerical data. 2.2. 3. Issues in Implementing the Conditional Distribution Approach There are several issues in implementing the conditional distribution approach in practice. The first issue is that the joint distribution f X S is almost never known. Second, even if f X S is known, deriving the conditional density fXS X S may not be possible except in a few cases such as the multivariate normal. It is common to assume a joint density for the original data and generate the perturbed values (Fienberg et al. 1998a, Muralidhar et al. 1999, Sarathy et al. 2002, Raghunathan et al. 2003, Burridge 2003). Hence the implementation of almost all masking approaches is heuristic, relying on estimates of the joint and/or conditional distribution of the variables. However, in these cases, we cannot truly claim that the resulting perturbed values strictly satisfy all the data utility requirements specified earlier. We can ensure only that the masked data maintain certain properties based on the underlying assumptions made in generating the masked values. Note, however, that whether Y is generated from the true, estimated, or assumed conditional distribution, as long as it is of the form yi = g S = s i (4) where g · is a function that represents the empirical conditional density fXS X S estimated from the data and the noise term is independent of X and S, we can show that the resulting Y satisfies ideal disclosure risk requirements similar to Equation (3). In other words, Equation (4) ensures that X and Y are conditionally independent given S, since the values of Y are generated as function of S and an independent noise term . The data shuffling procedure that we describe in the following sections satisfies this requirement. Another important consideration is the added variability because of random sampling from the conditional distribution and its impact on statistical inference. Even if the true underlying distribution is known, the random generation of the perturbed values results in variability that must be accounted for. This variability approaches 0 as n → , and is relatively small for large n. Regardless the size of the data set, the impact of the added variability on statistical inferences reached using the masked data (compared to that using the original data) must be addressed. Only a few techniques address this issue directly (Rubin 1993, Fienberg et al. 1998a, Burridge 2003). Existing Methods for Swapping Numerical Data Dalenius and Reiss (1982) originally proposed data swapping for masking confidential variables. Fienberg and McIntyre (2005) comprehensively discuss data swapping and its relationship to other methods. In this section, we focus on methods for swapping numerical variables. Reiss et al. (1982) made the first attempt, swapping the data using an optimization approach to maintain the first- and second-order moments. This approach is not based on the conditional distribution approach. It is also known to be computationally difficult and its disclosure risk remains to be evaluated. Moore (1996) describes the best-known data swapping procedure based on the data swapping algorithm Brian Greenberg proposes in an unpublished manuscript. The rank-based proximity swap (hereafter referred to simply as data swapping) for numerical variables can be described as follows: Sort the data by confidential variable j. Set the value of y k j = x i j and y i j = x k j . Repeat the process for every i and j to result in Yj and repeat the process of every j to result in Y. Moore (1996, p. 6) uses a masking parameter called the “swapping distance” parameter, defined as follows: Determine a value P a , with 0 ≤ P a ≤ 100. The intent of the procedure is to swap the value of ai with that of aj , so that the percentage difference of the indices, i and j, is less than P a of N . That is i − j < P a ∗ N /100. The larger the value of P a , the larger the value of i − k, and the greater the distance between the swapped values, and vice versa. Data swapping’s biggest advantage is that the marginal distributions of the individual confidential variables are identical to those of the original variables. Assuming a uniform distribution, Moore (1996) also shows an inverse relationship between swapping distance and data utility, and a direct relationship between swapping distance and disclosure risk, resulting in a trade-off. Recently, Carlson and Salabasis (2002) (hereafter referred to as C&S) describe a new method for data swapping. For this procedure, first consider a database D consisting of a set of variables S and X. Assume that the database is randomly divided into two (or even more) data sets D1 and D2 containing S1 X1 and S2 X2 , respectively. C&S (2002) suggest ranking both databases with respect to variable Xj . Muralidhar and Sarathy: Data Shuffling—A New Masking Approach for Numerical Data Downloaded from informs.org by [130.113.86.233] on 23 February 2015, at 06:45 . For personal use only, all rights reserved. Management Science 52(5), pp. 658–670, © 2006 INFORMS Since the two databases are independent samples, the ordered values of Xj1 should be approximately equal to Xj2 and the values of the masked variable Yj1 are created by replacing the ordered values of Xj1 by Xj2 . Similarly, another set of masked variables Yj2 may be created by replacing the ordered values of Xj2 by Xj1 . The process is then repeated for all j to ∗ ∗ result in Y1 and Y2 . Two new databases, D1 and D2 , are then created by combining S1 Y1 and S2 Y2 (Fienberg 2002). The two databases can then be potentially recombined to form D∗ . The authors also provide proof for this procedure as well as simulation results demonstrating their asymptotic properties. The C&S (2002) method, while innovative, focuses almost exclusively on data utility and does not directly address disclosure risk. The authors acknowledge that “the swap may not provide an acceptable level of perturbation and effectively mask the data.” Beyond this broad statement, no specific disclosure risk measures were evaluated. As we show in §5, this procedure provides practically no protection from either identity or value disclosure. Liew et al. (1985) proposed a method closely resembling the C&S (2002) method. They suggest that the univariate marginal distribution of each confidential variable be empirically identified. Using the respective distributions, n observations are generated for each confidential variable. The rank ordered original values are then replaced by the rank-ordered generated values for each variable. The Liew et al. (1985) procedure also focuses on data utility without appropriate consideration of disclosure risk. Consequently, it has been shown to have very high disclosure risk (Adam and Wortmann 1989, Muralidhar et al. 1999). The C&S (2002) method has two major advantages compared with the Liew et al. (1985) procedure. First, the Liew et al. (1985) procedure does not use the original values in the database, but instead replaces the values by randomly generated values. Second, the Liew et al. (1985) procedure requires the identification of the underlying univariate distribution of each confidential variable. The C&S (2002) method overcomes both these problems by dividing the database and using the rank-ordered values within the subsets. Dandekar et al. (2002) proposed a method called Latin Hypercube sampling for data swapping. This method attempts to recreate an independent data set with the same marginal characteristics and the same rank-order correlation as the original data set. Unfortunately, this procedure can be used only when all the variables in the database are confidential and must be swapped. When nonconfidential variables are present, this procedure would require either swapping the nonconfidential variables also, or swapping the confidential variables independently of the nonconfidential variables. Hence, this limits the method’s applicability. 661 Thus the literature on data swapping reveals that current procedures for swapping numerical variables do not simultaneously address data utility and disclosure risk, because they are not based on the conditional distribution approach. The conditional distribution approach provides significant advantages that outweigh the disadvantages. Fienberg et al. (1998b) argue that this approach is preferable to using ad hoc procedures, even with certain assumptions. Our objective in this paper is to develop a data swapping procedure based on the conditional distribution approach—a procedure that provides utility comparable with existing swapping methods without the adverse disclosure risks. 4. The Data Shuffling Procedure The data shuffling procedure is based on the conditional distribution approach and can be described as follows: Step 1. Generate an observation yiP from the conditional distribution fXS X S = si such that given S = si , YP is independent of X. Step 2. Repeat the process for every observation in the data set to obtain YP . Step 3. Replace y Pi j with x i j , i = 1 N , j = 1 M to obtain Y. That is, perform reverse mapping Y j = Y Pj ← X j . Step 4. Release the reordered (or shuffled) data set. Steps 1 and 2 of the shuffling procedure are similar to perturbation and “generate” a new set of values for the confidential variables using the conditional density fXS X S . However, yiP j values generated in this manner will, in general, be different from the observations in X. This “shortcoming” can be addressed by taking advantage of the fact that FY YjP = FX Xj , j = 1 M. Hence, in Step 3, for each confidential variable, we replace the ordered values of YjP with the ordered original values of Xj , ensuring that the shuffled data consists of the original values. Since both YjP and Xj are independent realizations from the same distribution, both have the same characteristics in terms of both data utility and disclosure risk (C&S 2002, p. 38). In other words, the reverse mapping does not alter the disclosure risk characteristics of the perturbed data. Thus, theoretically, the data shuffling procedure satisfies the ideal data utility and disclosure risk requirements. However, as discussed earlier, the joint distribution of X and S is unknowable in practice. Hence, heuristic approaches are needed to implement data shuffling. 4.1. Implementation of Data Swapping We propose a heuristic implementation of data shuffling based on the perturbation method proposed Muralidhar and Sarathy: Data Shuffling—A New Masking Approach for Numerical Data 662 Management Science 52(5), pp. 658–670, © 2006 INFORMS Downloaded from informs.org by [130.113.86.233] on 23 February 2015, at 06:45 . For personal use only, all rights reserved. by Sarathy et al. (2002) using the multivariate copula to model the joint density of a data set, where the variables have arbitrary marginal distributions and specified dependence characteristics (Nelsen 1999). Let R represent the rank-order correlation matrix of X S. Define variables X∗ and S∗ as follows: xi∗ j = −1 FXj xi j j = 1 M i = 1 N si∗ k = −1 FSk si k k = 1 L i = 1 N and (5) where FXj and FSk represent the cumulative distribution functions of variables Xj and Sk , respectively, and −1 represents the inverse of the standard normal distribution. The joint density of X∗ and S∗ is described by a multivariate standard normal distribution with correlation matrix , and the relationship between R and can be described by (Clemen and Reilly 1999) rij ij = 2 sin (6) 6 where rij are the elements R. Since X∗ and S∗ have a joint multivariate normal distribution, it is now possible to generate the perturbed values Y∗ as yi∗ = X∗ S∗ S∗ S∗ −1 ∗ si + ei (7) where ei ∼ MVN 0 X∗ X∗ − XS S∗ S∗ −1 S∗ X∗ . The values of Y∗ can then be retransformed back to the original marginal distribution of YP by yiP j = FX−1 yi∗ j j j = 1 M i = 1 N (8) The copula-based perturbation approach requires that for a given data set X S with a rank-order correlation matrix R, the marginal distributions of X and S are known, a random observation can be generated from the specified marginal distribution. This step is needed to generate a new value for the perturbed variable using Equation (8). However, in the data shuffling approach, we do not need to generate a new value; we only need the rank of the perturbed value. Hence, this procedure can be simplified further by using the following transformation in place of one used in Equation (5): i − 05 −1 ∗ si k = N k = 1 L i = 1 N (9) where i represents the rank order of si k . The values of yi∗ are generated from ∗ yi∗ = X∗ S∗ −1 S∗ S∗ si + ei (10) where e ∼ MVN 0 X∗ S∗ −1 S∗ S∗ S∗ X∗ . Since rank of y ∗i j = rank of y Pi j , replacing y ∗i j with x i j , j = 1 M, i = 1 N results in Y. Note that Equation (10) maintains conditional independence, ensuring that given S, X, and Y are independent. Thus the reverse-mapped values retain the same data utility and disclosure risk as the original perturbed values (C&S 2002). If all the variables in the data set are confidential, an independent, multivariate normal data set with correlation matrix is generated, and reverse mapping is performed on this data set. The simulation experiments to assess the disclosure risk characteristics of the data shuffling procedure (§§5.1 and 5.2) provide empirical verification of this result. The data shuffling approach does not require the marginal distributions of X and S to be identified, nor does it require any of the actual values in the data set to be used in the process of masking. The only data that is required is the rank-order correlation matrix of the original variables and the ranks of the values of the nonconfidential variables s i j . In many cases, the owners of the data set may not have the necessary expertise to perform the shuffling. The nonparametric data shuffling approach provides a methodology by which the shuffling can be performed securely using a third party that would never have access to any actual confidential data values. 4.2. Impact of Sample Size and Sampling Error on Data Shuffling As with other techniques, data shuffling results add variability to parameter estimates potentially affecting statistical inferences. This variability arises from the noise term used to generate perturbed values from the conditional distribution fXS X S and the reverse mapping process. The variability is likely very small for large data sets, but may be significant for small data sets. The extent of this variability is difficult to predict theoretically. We provide an empirical evaluation in the next section. Just as perturbation procedures have been modified to account for such variability (Rubin 1993, Fienberg et al. 1998a, Burridge 2003), modifications may be developed for the data shuffling procedure as well. In summary, we view the data shuffling procedure presented in this study to be part of the evolution of data swapping. Data perturbation approaches have also evolved from simple noise addition to current approaches. As Fienberg (2002) suggests, an important step in this evolution is developing a data swapping technique with underlying principles that are the same as perturbation techniques. Data shuffling combines the most recent (rank-based) copula perturbation approach with the most recent swapping approach (C&S 2002 method).3 3 Although we originally proposed data shuffling independently of C&S (2002), we believe that the rank-ordered swapping of the marginal represents the common link between the two methods. Muralidhar and Sarathy: Data Shuffling—A New Masking Approach for Numerical Data Management Science 52(5), pp. 658–670, © 2006 INFORMS Downloaded from informs.org by [130.113.86.233] on 23 February 2015, at 06:45 . For personal use only, all rights reserved. 5. An Empirical Comparison of Data Swapping Methods In this section, we describe simulation experiments to evaluate the relative performances of data swapping, the C&S (2002) method, and data shuffling. Data swapping requires specifying a masking (or the proximity) parameter. Three different values of the proximity parameter were used P a = 010 050 100 , hereafter referred to as DS10%, DS50%, and DS100%. Two sets of experiments were conducted to evaluate data utility and disclosure risk. 5.1. Impact of Data Set Size on Utility of Masked Data The objective of the first experiment was to investigate the impact of the size of the data set on the product moment correlation of the masked data compared to the original data (similar to the experiment in C&S 2002, §5.4, p. 48). In the experiment, a data set of a size n =30 100 300 1000 was generated from a bivariate normal distribution with a specified population correlation coefficient =00 02 04 06 08 095 . It was assumed that both variables in the data set were to be masked. Five different masked data sets (C&S 2002 method, data shuffling, and three different swaps) were generated. In implementing the C&S (2002) method, in addition to the generated data set, it was assumed that second independent data sets of the same size were available (and generated for this purpose). The rank-ordered values of both variables in the first set were swapped for the rank-ordered values of the respective variables in the second set. We chose this implementation because C&S (2002) show that this procedure results in lower bias than the swapping from two different data sets. As in the C&S study, we measured both the bias (which results in attenuation in correlation) and standard error (SE) of bias using the sample correlation coefficient as the benchmark (see C&S 2002, p. 49). We replicated each combination of data set size and correlation coefficient 10,000 times, and computed the average bias and SE of the sample correlation coefficient from the 10,000 replications. Table 1 shows the results of the simulation experiment, which are almost identical to the results provided by C&S (2002, Tables 2–5, pp. 50–51). In terms of data utility, both the C&S (2002) method and data shuffling show attenuation in sample correlation coefficient for small n that quickly approaches zero as n increases. For all three swapping methods, the attenuation is higher and does not approach zero as n increases. The attenuation is highest (and almost equal to the specified correlation) for DS100%, followed by DS50%, and DS10%. This is to be expected since DS100% represents the case where the values are 663 swapped randomly and almost completely destroys any relationships that might have existed. The results are similar when we evaluate SE. For small n, the SE is rather high even for the C&S (2002) method and data shuffling. Fienberg (2002) observes the same with the C&S (2002) method. This indicates that even though both the C&S method and data shuffling maintain correlations between variables asymptotically, for small data sets the correlation attenuation in the masked data could be significant. Hence, for small data sets, these methods must be used with caution. The SEs of DS10%, DS50%, and DS100% are higher than those observed for the C&S (2002) method and data shuffling. Thus, purely from the perspective of data utility as measured by the attenuation in correlation and the SE, it is hard to distinguish between data shuffling and the C&S (2002) method. In terms of relative performance, data shuffling and the C&S (2002) method perform the best, followed by DS10%, DS50%, and DS100%. The similarity in performance of the C&S (2002) method and data shuffling for the multivariate normal case is not surprising because the shuffled values are generated from the true conditional distribution. Generally, the C&S (2002) method does not make assumptions regarding the joint distribution of X and S, so it may provide better data utility than the data shuffling procedure. However, in data shuffling, the values of the confidential variables are perturbed prior to the rank-ordered replacement. This is not the case for the C&S (2002) method. As the following discussion shows, this difference has a major impact on the disclosure risk of the two methods. 5.2. Assessing Risk of Value Disclosure In this experiment, as a surrogate measure of value disclosure, we computed the proportion of variability that is explained in the confidential variable by using the corresponding masked variable (R2X1 Y1 and R2X2 Y2 for all five masking methods (Fuller 1993). To satisfy Equation (3), it is necessary that R2XS Y = R2XS . In this case, since S is null, to minimize disclosure risk, it is necessary that R2X1 Y1 = R2X2 Y2 = 0. Table 2 provides the average R2X1 Y1 and R2X2 Y2 resulting from each combination of correlation coefficient and data set size for all five masking methods. From Table 2 it is evident that the C&S (2002) method provides practically no security against risk of value disclosure. Even for a data set of 30, the proportion of variability in the confidential variable explained using the masked variable is more than 93% in all cases. When the data set is 1,000, the proportion of variability in the confidential variable using the masked variable is more than 99.95% in all cases. The poor disclosure risk performance of the C&S (2002) method is not surprising. C&S (2002, Muralidhar and Sarathy: Data Shuffling—A New Masking Approach for Numerical Data 664 Table 1 Management Science 52(5), pp. 658–670, © 2006 INFORMS Bias and Standard Error Resulting from Simulation Experiment Data set size 30 0.00 Downloaded from informs.org by [130.113.86.233] on 23 February 2015, at 06:45 . For personal use only, all rights reserved. 0.20 0.40 0.60 0.80 0.95 100 0.00 0.20 0.40 0.60 0.80 0.95 300 0.00 0.20 0.40 0.60 0.80 0.95 1,000 0.00 0.20 0.40 0.60 0.80 0.95 C&S (2002) method Data shuffling Data swapping (10%) Data swapping (50%) Data swapping (100%) Bias SE Bias SE Bias SE Bias SE Bias SE Bias SE −0000641 0066915 −0013962 0066355 −0027262 0063791 −0036869 0057457 −0043728 0046396 −0034909 0030778 −0001352 0068458 −0013758 0067250 −0026777 0063063 −0037706 0057452 −0043730 0046919 −0035300 0031542 0000551 0108844 −0032443 0107363 −0065604 0101812 −0095700 0093522 −0118208 0079000 −0129913 0062850 −0001045 0233072 −0157984 0225624 −0307165 0217107 −0460840 0202012 −0610875 0186656 −0723215 0177532 −0000950 0263763 −0200020 0259527 −0396881 0245209 −0594337 0224294 −0793003 0201219 −0946191 0189804 Bias SE Bias SE Bias SE Bias SE Bias SE Bias SE −0000107 0022496 −0005065 0022104 −0010030 0021204 −0014155 0019108 −0016616 0015279 −0013024 0008828 −0000260 0022872 −0005090 0022358 −0009880 0021248 −0014178 0019010 −0016775 0015314 −0013062 0009120 −0000989 0050805 −0025164 0050074 −0049637 0047616 −0071887 0042576 −0090536 0036116 −0099643 0028103 −0001064 0124219 −0149044 0121077 −0296934 0116631 −0441791 0106818 −0582097 0096673 −0684039 0089905 −0000050 0142618 −0198753 0140435 −0397648 0133167 −0597483 0120205 −0797881 0106751 −0950754 0102096 Bias SE Bias SE Bias SE Bias SE Bias SE Bias SE −0000018 0007890 −0002102 0007752 −0004086 0007396 −0005704 0006826 −0006736 0005494 −0005026 0002974 0000340 0007869 −0001658 0007840 −0003557 0007498 −0005220 0006721 −0006326 0005365 −0004868 0002844 0000105 0027860 −0022744 0027705 −0045264 0026525 −0065764 0023619 −0082758 0019968 −0091490 0014765 0001819 0069805 −0146747 0067926 −0291108 0066483 −0434570 0059363 −0575702 0054769 −0674581 0051489 0001969 0082571 −0199422 0080740 −0398856 0075039 −0599187 0070232 −0800379 0061630 −0948211 0058183 Bias SE Bias SE Bias SE Bias SE Bias SE Bias SE −0000077 0002642 −0000823 0002587 −0001332 0000006 −0001884 0002223 −0002268 0001793 −0001740 0000916 0000031 0002622 −0000745 0002598 −0001518 0000006 −0001976 0002264 −0002264 0001676 −0001675 0000884 0000070 0015727 −0022539 0014937 −0043130 0000196 −0064120 0012245 −0079229 0010176 −0087542 0007910 −0001482 0038503 −0147750 0038802 −0292616 0001266 −0435326 0032524 −0572165 0030401 −0670493 0027450 0000286 0042941 −0200895 0042614 −0402256 0001493 −0600270 0035648 −0799787 0032682 −0949409 0031424 p. 38) observe, “Furthermore, since E Xr"1 = E Xr"2 , one would expect Xr"1 to be approximately equal to Xr"2 for large n ” Conversely, this statement also means that the masked value would be approximately equal to the original value. As the results in Table 2 show, for large n, the level of masking would be negligible, resulting in high risk of value disclosure. By contrast, Table 2 shows that data shuffling provides good security against risk of value disclosure. In almost all cases, the proportion of variability explained in the confidential variable using the masked variable is close to 0. The performance of DS100% compares with that of data shuffling with R2 close to 0 in all cases. Thus, data shuffling and DS100% almost completely eliminate risk of value disclosure. DS50% and DS10% provide some information Muralidhar and Sarathy: Data Shuffling—A New Masking Approach for Numerical Data 665 Management Science 52(5), pp. 658–670, © 2006 INFORMS Table 2 Value Disclosure Results from Simulation Experiment Proportion of variability explained C&S (2002) method Downloaded from informs.org by [130.113.86.233] on 23 February 2015, at 06:45 . For personal use only, all rights reserved. Data set size Data shuffling Data swapping (10%) Data swapping (50%) Data swapping (100%) X1 Y1 X2 Y 2 X1 Y 1 X2 Y2 X1 Y1 X2 Y 2 X1 Y 1 X2 Y 2 X1 Y1 X2 Y2 30 0.00 0.20 0.40 0.60 0.80 0.95 0.934529 0.934529 0.934529 0.935262 0.935456 0.934534 0.934854 0.934612 0.934200 0.935353 0.935104 0.934669 0.000000 0.000000 0.000000 0.000003 0.000001 0.000003 0.000001 0.000005 0.000000 0.000024 0.000001 0.000000 0.828411 0.828411 0.828411 0.827240 0.829357 0.828431 0.829045 0.829571 0.829046 0.828568 0.829117 0.828306 0.215126 0.215138 0.215148 0.215452 0.215700 0.214481 0.218040 0.217683 0.217049 0.216955 0.217574 0.217094 0.001025 0.001025 0.001025 0.001467 0.001096 0.000955 0.001145 0.001127 0.001121 0.001129 0.000835 0.001187 100 0.00 0.20 0.40 0.60 0.80 0.95 0.975144 0.975167 0.975144 0.975167 0.975139 0.975131 0.974946 0.974961 0.974972 0.974993 0.974975 0.975073 0.000000 0.000002 0.000001 0.000000 0.000000 0.000000 0.000017 0.000001 0.000000 0.000005 0.000001 0.000000 0.871473 0.871176 0.871473 0.871176 0.871086 0.871382 0.871500 0.871048 0.871071 0.871274 0.871613 0.871811 0.256386 0.256094 0.256386 0.256094 0.255945 0.256096 0.257803 0.257500 0.257524 0.257529 0.257399 0.257585 0.000108 0.000107 0.000108 0.000107 0.000113 0.000112 0.000097 0.000121 0.000102 0.000126 0.000101 0.000100 300 0.00 0.20 0.40 0.60 0.80 0.95 0.990210 0.990230 0.990233 0.990213 0.990236 0.990222 0.990387 0.990459 0.990371 0.990316 0.990177 0.990117 0.000001 0.000004 0.000004 0.000006 0.000000 0.000000 0.000000 0.000003 0.000001 0.000005 0.000002 0.000001 0.882647 0.882726 0.882849 0.882670 0.882766 0.882731 0.883192 0.883334 0.883464 0.883645 0.883278 0.882752 0.266160 0.266230 0.266252 0.266144 0.266151 0.266178 0.267111 0.267310 0.267737 0.267684 0.267329 0.267108 0.000012 0.000012 0.000014 0.000013 0.000012 0.000013 0.000006 0.000006 0.000006 0.000006 0.000006 0.000005 1,000 0.00 0.20 0.40 0.60 0.80 0.95 0.996733 0.996668 0.996650 0.996651 0.996690 0.996739 0.996673 0.996644 0.996619 0.996584 0.996558 0.996637 0.000000 0.000001 0.000000 0.000000 0.000000 0.000001 0.000000 0.000009 0.000005 0.000000 0.000001 0.000000 0.887913 0.887913 0.887913 0.887935 0.887943 0.887913 0.886925 0.886911 0.887015 0.887288 0.887658 0.887864 0.272723 0.272723 0.272723 0.272724 0.272744 0.272723 0.271392 0.271262 0.271304 0.271375 0.271563 0.271674 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000033 0.000032 0.000031 0.000032 0.000030 0.000030 to the intruder (the proportion of variability is greater than 0), but perform better than the C&S (2002) method. 5.3. Assessing Risk of Identity Disclosure We conducted an additional simulation experiment to assess the risk of identity disclosure. In this simulation, a multivariate normal data set with k variables and n observations was generated with a mean vector of 0 and identity covariance matrix. As before, the data was masked using the five masking methods. For each observation, we assumed that the intruder has the vector of the original (unmasked) values. Using this information, the intruder attempts to match the true value vector to the masked data that is released (see Fuller 1993, p. 388, Equation 11). The procedure is then repeated for each observation. The percentage of successful matches for each masking technique was recorded. The entire procedure was replicated 1,000 times. The simulation experiment was conducted for n = 30, 100, 300, and 1,000 and for k = 2 to 6. This assessment of disclosure risk is similar to those used by Fuller (1993) and Winkler (1998). The results of the simulation experiment, provided in Table 3, clearly show that the C&S (2002) method performs poorly in preventing identity disclosure. Even with just two variables, an intruder could reidentity more than 60% of all observations in the data set. With five or more variables, the probability of reidentification exceeds 99% for all data sets. In these cases, the C&S (2002) method provides practically no security. By contrast, data shuffling provides excellent security against risk of identity disclosure. In almost all cases, the percentage of observations reidentified is close to 1/n , the probability of reidentification by chance alone. We performed a hypothesis test (with the null hypothesis that the proportion of reidentified observations = 1/n versus the alternative hypothesis that the proportion of reidentified observations is > 1/n) for the results of data shuffling. In all cases, we were unable to reject the null hypothesis, indicating that the proportion of reidentified observations is not significantly different from 1/n. The reidentification results for the three data swapping techniques follow the expected pattern. DS100% provides the best results, almost comparable to those of data shuffling, followed by DS50%, and finally DS10%. However, all three data swapping procedures provide higher security than the C&S (2002) method. An intruder could use more sophisticated techniques for reidentification such as those described in Winkler (1995a, b) and Fienberg et al. (1997). However, given that data shuffling and DS100% outperform the other methods by a significant margin, our overall conclusion is unlikely to change. Muralidhar and Sarathy: Data Shuffling—A New Masking Approach for Numerical Data 666 Management Science 52(5), pp. 658–670, © 2006 INFORMS Table 3 Identity Disclosure Results Data set size Number of variables C&S (2002) method (%) Data shuffle (%) Data swapping (10%) Data swapping (50%) Data swapping (100%) 30 2 3 4 5 6 7242 9366 9857 9973 9997 3.55 3.98 4.20 4.25 3.89 60.04 82.43 91.05 94.83 96.66 2661 4575 5606 6102 6182 4.29 3.98 4.03 4.10 4.10 100 2 3 4 5 6 6646 9385 9901 9986 9996 1.03 1.03 1.06 1.11 1.01 55.32 87.15 96.56 99.11 99.61 1037 2695 5032 7098 8431 1.14 1.14 1.21 1.24 1.21 300 2 3 4 5 6 6262 9513 9944 9993 9999 0.32 0.34 0.35 0.35 0.32 48.68 89.51 98.33 99.74 99.94 390 1155 2726 5055 7240 0.38 0.38 0.47 0.40 0.42 1,000 2 3 4 5 6 6118 9693 9966 9997 10000 0.11 0.09 0.10 0.10 0.12 37.01 88.97 98.63 99.88 99.99 133 443 1172 2550 4607 0.11 0.12 0.12 0.12 0.13 The results in Tables 2 and 3 also confirm another important aspect of data shuffling. A key aspect of perturbation based on the conditional distribution approach is the conditional independence of X and Y given S. The results, shown in Tables 2 and 3, confirm empirically that the conditional independence assumption holds well even for small data sets. Hence, data shuffling provides very high security against risk of both value and identity disclosure. We also constructed the R-U confidentiality map (Duncan et al. 2001) (see Figure 1) to illustrate the relative performance of the different methods for n = 100 and specified population correlation coefficient = 04. As a surrogate measure of data utility, we used the percentage absolute bias Bias/04 . We used value disclosure as the measure of disclosure risk. The R-U confidentiality map clearly shows that data shuffling is superior to the other approaches. For the data swapping methods, a trade-off appears between data utility and disclosure risk. The C&S (2002) method provides high data utility, but also high disclosure risk. Data shuffling is the only method that simultaneously provides the highest possible data utility and lowest disclosure risk. Although we present only one graph, it can be easily verified that other n and combinations would provide similar results. In conclusion, data shuffling is preferable to the other techniques used in this study. Figure 1 We conducted extensive simulation experiments to investigate the sensitivity of data shuffling to the number of nonconfidential variables in the data set and the underlying distribution of the confidential variables. Note that data shuffling was implemented using only the values of the nonconfidential variables and rank-order correlation among the variables. No prior information regarding the confidential variables was assumed, and no attempt was made to estimate the marginal distribution or other characteristics of any of the variables. In the simulation experiments, the number of nonconfidential variables was varied between 2, 4, and 6. All nonconfidential variables were assumed to be binary key variables and to be independent of one another. The number of confidential variables was 100% R-U Confidentiality Map for n = 100 and = 040 C&S (2002) method Data swap 10% Value disclosure risk Downloaded from informs.org by [130.113.86.233] on 23 February 2015, at 06:45 . For personal use only, all rights reserved. Percentage of observations reidentified 75% 50% Data swap 50% 25% Data shuffling 0% 0% 10% Data swap 100% 20% Absolute bias 30% 40% 6. Sensitivity of Data Shuffling to Data Set Characteristics Muralidhar and Sarathy: Data Shuffling—A New Masking Approach for Numerical Data 667 Downloaded from informs.org by [130.113.86.233] on 23 February 2015, at 06:45 . For personal use only, all rights reserved. Management Science 52(5), pp. 658–670, © 2006 INFORMS specified as two. Several distributions were used to generate the confidential variables (normal, gamma, lognormal, and Weibull). Three levels of correlation (0.0, 0.10, 0.20, and 0.30) were specified for the relationships between the confidential variables and between the nonconfidential and confidential variables. The rank-order correlations between nonconfidential variables were specified as 0.0. Four different sample sizes were used (30, 100, 300, and 1,000). The data were generated using the procedure suggested by Clemen and Reilly (1999). Note that it is not possible to generate data sets for all correlation values for the 4 and 6 categorical variables. For each generated data set, the rank-order correlation between each nonconfidential and confidential variable was computed for the original and masked data. The difference in the rank-order correlation was computed and recorded. The entire experiment was repeated 10,000 times. Because the objective of the procedure was to investigate the sensitivity of data shuffling, only this procedure was implemented. Table 4 provides select results of the experiment for all three correlation levels, two different sample sizes (30 and 100), and one specification for the marginal distribution of the confidential variables (Gamma and Table 4 Lognormal). The results in Table 4 show no specific attenuation in the rank-order correlation. The difference between the correlation of the original and masked data is negative in some cases and positive in others. The results also indicate that both bias and standard error approach zero as the sample size increases. Table 4 shows that for a given sample size and correlation and regardless of the number of nonconfidential variables, the bias and SE are the same. The results of the other experiments also verified this conclusion. A comparison of the results across experiments using different specifications for the marginal distribution of the confidential variables also revealed no differences. This is not surprising because we do not need information regarding the marginal distribution of the variables to perform data shuffling; we need only the ranks of the nonconfidential variables. As with the nonconfidential variables, the bias and SE were inversely related to sample size and independent of the marginal distribution of the confidential variables. Hence, we can conclude that in terms of maintaining relationships among variables, data shuffling is not sensitive to the number of nonconfidential variables or the underlying distribution of the confidential variables. Sensitivity of Data Shuffling in Maintaining Rank-Order Correlation = 00 k Rank-order correlation between n = 30 = 02 n = 100 n = 30 = 030 n = 100 n = 30 Bias SE Bias SE Bias SE Bias SE Bias SE Bias SE 00041 00045 00039 00039 −00044 0.0420 0.0421 0.0424 0.0418 0.0629 00064 00065 00066 00060 −00003 0.0205 0.0204 0.0206 0.0202 0.0303 2 S1 S1 S2 S2 X1 & X1 & X2 & X1 & X2 & X2 −00003 −00001 00001 00005 −00002 0.0452 0.0452 0.0454 0.0453 0.0635 −00001 00001 −00001 00003 −00001 0.0222 0.0222 0.0226 0.0225 0.0312 00017 00019 00024 00028 −00041 0.0440 0.0439 0.0449 0.0442 0.0635 00038 00035 00031 00034 −00007 0.0215 0.0217 0.0215 0.0212 0.0316 4 S 1 & X1 S 1 & X2 S 2 & X1 S 2 & X2 S 3 & X1 S 3 & X2 S 4 & X1 S 4 & X2 X 1 & X2 00001 −00002 00000 00003 −00001 00000 00003 00007 −00005 0.0458 0.0454 0.0464 0.0458 0.0455 0.0454 0.0460 0.0456 0.0633 −00003 −00002 00002 00003 00000 −00004 00001 −00001 00000 0.0228 0.0224 0.0229 0.0227 0.0223 0.0226 0.0226 0.0226 0.0312 00035 00026 00017 00015 00031 00032 00022 00018 −00022 0.0440 0.0438 0.0438 0.0439 0.0433 0.0435 0.0432 0.0442 0.0629 00036 00037 00038 00038 00034 00036 00039 00038 −00001 0.0216 0.0218 0.0218 0.0217 0.0215 0.0217 0.0217 0.0219 0.0307 6 S1 S1 S2 S2 S3 S3 S4 S4 S5 S5 S6 S6 X1 00007 −00004 00002 00002 −00001 −00006 00003 00001 −00006 00003 −00004 00003 00007 0.0455 0.0456 0.0459 0.0458 0.0454 0.0456 0.0453 0.0454 0.0452 0.0453 0.0452 0.0454 0.0627 −00001 00000 −00001 −00001 00000 −00003 00000 −00001 00000 −00001 00001 00003 00000 0.0226 0.0223 0.0224 0.0222 0.0224 0.0222 0.0220 0.0227 0.0227 0.0224 0.0226 0.0224 0.0314 & X1 & X2 & X1 & X2 & X1 & X2 & X1 & X2 & X1 & X2 & X1 & X2 & X2 n = 100 668 Downloaded from informs.org by [130.113.86.233] on 23 February 2015, at 06:45 . For personal use only, all rights reserved. 7. Muralidhar and Sarathy: Data Shuffling—A New Masking Approach for Numerical Data Management Science 52(5), pp. 658–670, © 2006 INFORMS Application in a Database Context In this section, we illustrate the data shuffling procedure for a large data set such as, perhaps, for data stored in data warehouses. The example consists of six variables (three nonconfidential and three confidential variables). The three nonconfidential variables represent Gender (Male = 0, Female = 1), Marital Status (Married = 0, Other = 1), and an ordinal discrete variable Age with integer values between 1 and 6 (in increasing order of Age). The confidential variables consist of three continuous variables representing Home Value (Lognormal), Mortgage Balance (Gamma), and Total Net Asset Value (Normal). A data set with 50,000 observations was created using the procedure suggested by Clemen and Reilly (1999) for generating a multivariate data set with nonnormal marginal distributions.4 Data shuffling was performed using only the values of the nonconfidential variables and rank-order correlation (and without any knowledge of the underlying characteristics of the population from which the data was generated). No attempt was made to estimate the characteristics of any of the variables. Table 5 provides the rank-order correlation between the variables pre- and postmasking. In these cases, the results clearly show that the shuffling procedure maintains rank-order correlation that is very close to the original values. We also know from the results in the previous section that the bias is likely to be close to zero and the standard error is likely to be very small. Considering that all rank-order correlations are maintained, analyses performed on this data should yield results that are very similar to those using the original data. Users often issue ad hoc queries to organizational databases. To envision every type of query that might be issued would be impossible. However, the marginal characteristics of the variables remain unmodified, and the results of all experiments consistently show that data shuffling maintains relationships among variables that are the same as before masking. Hence we believe that responses to ad hoc queries using the masked data would yield results that are similar to the results that would be obtained using the original data. 8. Limitations and Future Research The data shuffling approach is limited mainly by two factors that were discussed in detail in §2; namely, the assumption regarding the joint distribution of the variables and the performance of the procedure for small data sets. The results in this study show that when we consider data utility and disclosure risk 4 The entire data set can be found at http://gatton.uky.edu/ faculty/muralidhar/MSShuffle.xls. Table 5 Rank-Order Correlations for Original and Masked Data in Example Database Correlation between Original Shuffled Home value and Gender Marital status Age Mortgage balance Total net value of assets −000373 −000187 057146 058229 068129 −000889 −000025 058779 058008 067958 Mortgage balance and Gender Marital status Age Total net value of assets −000409 −000093 028334 078156 −000881 000210 028591 078105 Total net value of assets and Gender Marital status Age −000510 007426 037367 −000554 008377 037829 simultaneously, the data shuffling procedure outperforms existing swapping procedures. In practice, the variables may have complex nonmonotonic relationships that data shuffling would not preserve. In addition, given the nature of the data shuffling (or for that matter all masking approaches), it will be difficult to implement the procedure on a constantly changing data base. Hence we envision implementing the procedure on historical data sets such as those stored in data warehouses rather than on “transactional” databases that involve frequent changes. In such situations, alternative approaches such as the confidentiality via camouflage procedure (Gopal et al. 2002) may be considered. Another important issue worthy of further investigation is the added variability that results from the data shuffling procedure and its impact on statistical inference using the masked data. Three studies have directly addressed this issue (Rubin 1993, Fienberg et al. 1998a, Burridge 2003). The data shuffling approach proposed in this study can possibly be modified according to the procedures used in those studies. An interesting related line of research is to formulate the entire data shuffling procedure as a mathematical programming problem. This problem is rather complex, involving a large number of 0 1 variables and nonlinear constraints. More important, this problem is also NP-complete (Riess et al. 1982). However, for small data sets, the optimization approach may provide sufficient statistics and yield better solutions than the data shuffling procedure. For larger problems, the shuffled data may provide a good initial solution to the optimization problem. The problem also has some unique structural aspects that may benefit from a closer evaluation by optimization experts. Muralidhar and Sarathy: Data Shuffling—A New Masking Approach for Numerical Data Management Science 52(5), pp. 658–670, © 2006 INFORMS Downloaded from informs.org by [130.113.86.233] on 23 February 2015, at 06:45 . For personal use only, all rights reserved. 9. Conclusions In this study, we have developed a new shuffling procedure for masking confidential, numerical data— a procedure that is based on the conditional distribution approach. The advantages of this approach can be summarized as follows: (1) The released data consist of the original values of the confidential variables (i.e., the marginal distribution is maintained exactly). (2) All pairwise monotonic relationships among the variables in the released data are the same as those in the original data. (3) Providing access to the masked microdata does not increase the risk of disclosure. Existing methods of data swapping do not provide these benefits. In addition, by using the conditional distribution in generating the masked values, data shuffling allows data swapping techniques to perform on par with perturbation techniques. The issue of the added variability and its impact on inferences obtained from the masked data remains unanswered and is worthy of further investigation. Selecting the appropriate method of masking depends on the users’ needs. If the user desires that the original values of the confidential variables are not modified (i.e., univariate characteristics of the masked and original data are identical), existing methods fail to provide the same level of data utility and disclosure risk characteristics that are available with perturbation methods. By combining the strengths of the most recent method of data perturbation (copula-based perturbation) and data swapping (C&S 2002 method), the data shuffling procedure described in this study fills this gap—providing data utility and disclosure risk comparable to perturbation methods. Finally, as calls for preserving privacy and confidentiality increase, the needs for masking techniques accelerate rapidly. Data swapping techniques appeal intuitively because they do not modify original values. Data swapping is also easier to explain to the common user than other techniques. Potentially, these characteristics can make data swapping the masking technique of choice. Among data swapping methods, only data shuffling provides the same data utility and disclosure risks as other advanced masking techniques, particularly for large data sets where the asymptotic requirements are likely to be satisfied. Using data shuffling in these situations would not only provide optimal results, but would also foster greater user acceptance of masked data for analysis. Acknowledgments The authors thank the reviewers and editors for their valuable comments and suggestions. They also thank Dr. William Winkler of the U.S. Census Bureau for providing them with the record linkage software. 669 References Adam, N. R., J. C. Wortmann. 1989. Security-control methods for statistical databases: A comparative study. ACM Comput. Surveys 21 515–556. Burridge, J. 2003. Information preserving statistical obfuscation. Statist. Comput. 13 321–327. Carlson, M., M. Salabasis. 2002. A data swapping technique for generating synthetic samples: A method for disclosure control. Res. Official Statist. 6 35–64. Clemen, R. T., T. Reilly. 1999. Correlations and copulas for decision and risk analysis. Management Sci. 45 208–224. Dalenius, T. 1977. Towards a methodology for statistical disclosure control. Statistisktidskrift 5 429–444. Dalenius, T., S. P. Reiss. 1982. Data-swapping: A technique for disclosure control. J. Statist. Planning Inference 6 73–85. Dandekar, R. A., M. Cohen, N. Kirkendall. 2002. Sensitive microdata protection using Latin hypercube sampling technique. J. Domingo-Ferrer, ed. Inference Control in Statistical Databases. Springer-Verlag, New York. Duncan, G. T., D. Lambert. 1986. Disclosure-limited data dissemination. J. Amer. Statist. Assoc. 81 10–18. Duncan, G. T., R. W. Pearson. 1991. Enhancing access to microdata while protecting confidentiality: Prospects for the future. Statist. Sci. 6 219–239. Duncan, G. T., S. A. Keller-McNulty, S. L. Stokes. 2001. Disclosure risk vs. data utility: The R-U confidentiality map. Technical report LA-UR-01-6428, Los Alamos National Laboratory, Los Alamos, NM. Fienberg, S. E. 2002. Comment on paper by Carlson and Salabasis: “A data swapping technique for generating synthetic samples: A method for disclosure control.” Res. Official Statist. 6 65–67. Fienberg, S. E., J. McIntyre. 2005. Data swapping: Variations on a theme by Dalenius and Reiss. J. Official Statist. 21 309–323. Fienberg, S. E., U. E. Makov, A. P. Sanil. 1997. A Bayesian approach to data disclosure: Optimal intruder behavior for continuous data. J. Official Statist. 13 75–89. Fienberg, S. E., U. E. Makov, A. P. Steele. 1998a. Disclosure limitation using perturbation and related methods for categorical data. J. Official Statist. 14 485–502. Fienberg, S. E., U. E. Makov, A. P. Steele. 1998b. Rejoinder. J. Official Statist. 14 509–511. Fuller, W. A. 1993. Masking procedures for microdata disclosure limitation. J. Official Statist. 9 383–406. Gopal, R., R. Garfinkel, P. Goes. 2002. Confidentiality via camouflage: The CVC approach to disclosure limitation when answering queries to databases. Oper. Res. 50 501–516. Kooiman, P. 1998. Comment. J. Official Statist. 14 503–508. Liew, C. K., U. J. Choi, C. J. Liew. 1985. A data distortion by probability distribution. ACM Trans. Database Systems 10 395–411. Little, R. J. A. 1993. Statistical analysis of masked data. J. Official Statist. 9 407–426. Moore, R. A. 1996. Controlled data swapping for masking public use microdata sets. Research report series no. RR96/04, U.S. Census Bureau, Statistical Research Division, Washington, D.C. Muralidhar, K., R. Sarathy. 2003a. A theoretical basis for perturbation methods. Statist. Comput. 13 329–335. Muralidhar, K., R. Sarathy. 2003b. A rejoinder to the comments by Polettini and Stander. Statist. Comput. 13 339–342. Muralidhar, K., R. Parsa, R. Sarathy. 1999. A general additive data perturbation method for database security. Management Sci. 45 1399–1415. Muralidhar, K., R. Sarathy, R. Parsa. 2001. An improved security requirement for data perturbation with implications for e-commerce. Decision Sci. 32 683–698. Nelsen, R. B. 1999. An Introduction to Copulas. Springer, New York. Downloaded from informs.org by [130.113.86.233] on 23 February 2015, at 06:45 . For personal use only, all rights reserved. 670 Muralidhar and Sarathy: Data Shuffling—A New Masking Approach for Numerical Data Raghunathan, T. E., J. P. Reiter, D. B. Rubin. 2003. Multiple imputation for statistical disclosure limitation. J. Official Statist. 19 1–6. Reiss, S. P., M. J. Post, T. Dalenius. 1982. Non-reversible privacy transformations. Proc. ACM Sympos. Principles Database Systems, Los Angeles, CA, 139–146. Rubin, D. B. 1993. Discussion of statistical disclosure limitation. J. Official Statist. 9 461–468. Sarathy, R., K. Muralidhar, R. Parsa. 2002. Perturbing non-normal confidential variables: The copula approach. Management Sci. 48 1613–1627. Wall Street Journal. 2001. Bureau blurs data to keep names confidential. (February 14) B1–B2. Management Science 52(5), pp. 658–670, © 2006 INFORMS Willenborg, L., T. de Waal. 2001. Elements of Statistical Disclosure Control. Springer, New York. Winkler, W. E. 1995a. Advanced methods for record linkage. Proc. Survey Res. Methods Section, American Statistical Association, Alexandria, VA. Winkler, W. E. 1995b. Matching and record linkage. B. G. Cox, ed. Business Survey Methods. John Wiley and Sons, New York, 355–384. Winkler, W. E. 1998. Producing public-user microdata that are analytically valid and confidential. Statistical research report series no. RR98/02, U.S. Census Bureau, Statistical Research Division, Washington, D.C.

Data Shuffling: A New Masking Approach for Numerical Data

Related documents

Products

Support

Data Shuffling: A New Masking Approach for Numerical Data

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib