Presentation - Gatton College of Business and Economics

advertisement
A PRIMER ON DATA MASKING
TECHNIQUES FOR NUMERICAL
DATA
Krish Muralidhar
Gatton College of Business & Economics
My Co-author

I would first like to acknowledge that most of my
work in this area is with my co-author Dr. Rathindra
Sarathy at Oklahoma State University
Introduction

Data masking deals with techniques that can be
used in situations where data sets consisting of
sensitive (confidential) information are “masked”.
The masked data retains its usefulness without
compromising privacy and/or confidentiality. The
masked data can be analyzed, shared, or
disseminated without risk of disclosure.
A Simple Example
Original Data
Masked Data
Objectives of Data Masking


Minimize risk of disclosure resulting from providing
access to the data
Maximize the analytical usefulness of the data
What this talk is not about …



We are talking about protecting data that is made
available to users, shared with others, or disseminated
to the general public
We are not dealing with unauthorized access to the
data
Encryption is not a solution

We cannot perform analysis on encrypted data

There are a few exceptions
To perform analysis on the data, it must be decrypted
 Decrypted data offers no protection

What this talk is not about …

Since we have the data set, we know the characteristics of the
data set. We are trying to create a new data set that
essentially contains the same characteristics as the original
data set. We are not trying discern the characteristics of the
original data set using the information in the masked data.



In other words, I will not be talking about Agrawal and Srikant
Or about the “distributed data” situation
In addition, since most of you are probably familiar with the
CS literature on this area, I will focus on the literature in the
“statistical disclosure limitation” area
Purpose of Dissemination

It is assumed that the data will be used primarily
for analysis at the aggregate level using statistical
or other analytical techniques
 The
data will not be accurate at the individual record
level
Aggregate versus Micro Data




The organization that owns the data could potentially
release aggregate information about the
characteristics of the data set
The users can still perform some types of analyses
using the aggregate data, but limits the ability of the
users to perform ad hoc analysis
Releasing the microdata provides the users with the
flexibility to perform any type of analysis
In this talk, we assume that the intent is to release
microdata
Other Protection Measures



Restricted access
Query restrictions
Other methods
The Data

Typically, the data is historical and consists of
Categorical variables (or attributes)
 Numerical variables




Discrete variables
Continuous variables
In cases where identity is not to be revealed, key
identification variables will be removed from the
data set (de-identified)
De-identification does not necessarily prevent
Re-identification

The common misconception is that, in order to prevent
disclosure, all that is required is to remove “key
identifiers”. However, even if the “key identifiers” are
removed, in many cases it would be easy to indentify
an individual using external data sources
Latanya Sweeney’s work on k-anonymity
 The availability of numerical data makes it easy to reidentify records through record linkage

Data Masking

Since de-identification alone does not prevent
disclosure, it is necessary to “mask” the original
data so that an intruder, even using external sources
of data, cannot
 Identify
a particular released record as belonging to a
particular individual
 Estimate the value of a confidential variable for a
particular record accurately
Who is an intruder?



Every user is potentially an
intruder
Since microdata is
released, we cannot
prevent the user from
performing any type of
analysis on the released
data
Must account for disclosure
risk from any and all types
of analyses

Worst case scenario
The focus of our research


Data masking techniques are used to mask all types
of data (categorical, discrete numerical, and
continuous numerical)
The focus of our research, and of this talk, is data
masking for continuous numerical data
The Data Release Process


Identify the data set to be released and the sensitive variables
in the data
Release all aggregate information regarding the data




Release non-sensitive data


Characteristics of individual variables
Relationship measures
Any other relevant information
Since my focus is on numerical microdata, I will assume that all
categorical and discrete data are either released unmasked or are
masked prior to release
Release masked numerical microdata
Characteristics of a good masking
technique



Minimize disclosure risk (or maximize security)
Minimize information loss (or maximize data utility)
Other characteristics
 Must
be easy to use
 The
user must be able to analyze the masked data exactly
as he/she would the original data
 Must
be easy to implement
Disclosure Risk

Dalenius defines disclosure as having occurred if,
using the released data, an intruder is able to
identify an individual or estimate the value of a
confidential variable with a greater level of
accuracy than was possible prior to such data
release
Minimum Disclosure Risk

A data masking technique minimizes disclosure risk, IFF, the
release of the masked microdata does not allow an intruder to
gain additional information about an individual record over
and above what was already available (from the release of
aggregate information, the non-confidential variables, and the
masked categorical variables)


Does not mean that the disclosure risk from the entire data release
process is minimum; only that the disclosure risk from releasing microdata
is minimized
Can be achieved in practice
Practical Measure of Disclosure Risk

Identity Disclosure
 Re-identification

rate
Value disclosure
 Variability
in the confidential attribute explained by
the masked data
Minimum Information Loss


Information loss is minimized IFF, for any arbitrary
analysis (or query), the response from the masked
data is exactly the same as that from the original
data
Impossible to achieve in practice
 Since
an arbitrary analysis may involve a single record,
the only way to achieve this objective is to release
unmasked data
Information Loss … continued


In practice, we attempt to minimize information loss by
maintaining the characteristics of the masked data to
be the same as that of the original data
From a statistical perspective, we attempt to maintain
the masked data to be “similar to” the original data
so that responses to analyses using the masked data
will be the approximately same as that using the
original data

Maintain the distribution of the masked data to be the
same as the original data
Some Practical Measures of Information
Loss

Ability to maintain
 The
marginal distribution
 Relationships between variables
 Linear
 Monotonic
 Non-monotonic
Simple Masking Approaches




Noise addition
Micro-aggregation
Data swapping
Other similar approaches
 Any
approach in which the masked value yij (the
masked value for the jth variable of the ith record) is
generated as a function of xi.
An Illustrative Example


A data set consisting
of 2 categorical
variables, 1 discrete,
and 3 (confidential)
numerical variables
50000 records
Marginal Distribution

Home value and
Mortgage balance
have heavily skewed
distributions
Relationships

Relationships
are not
necessarily
linear
 Measured
by
both product
moment and
rank order
correlation
Relationship Measures
Simple Noise Addition

The most rudimentary method of data masking. Add
random noise to every confidential value of the
form
 yi
= xi + ei
 Typically e ~ Normal(0, d*Var(Xi))
 The selection of d specifies the level of noise. Large d
indicates higher level of masking
 The variance is changed resulting in biased estimates
 Many variations exist
Problems with Noise Addition


The addition of noise results in an increase in
variance
This can be addressed easily, but there are other
issues that cannot be, such as
 The
marginal distribution is modified
 All Relationships are attenuated
Results for Noise Addition
(Noise level = 10%)
Mortgage Balance versus Asset balance (Noise Added)
Relationship –
Product Moment
Looks good …

Everything looks good
 Bias
is small
 Relationships seem to be maintained

So what is the problem?
 The
problem is security
 Since very little noise is added, there is very little
protection afforded to the records
High Disclosure Risk
The correlation between the original and masked values is of the order
of 0.99. The masked values themselves are excellent predictors of the
original value. Little or no “masking” is involved.
Improved Predictive Ability
Disclosure Risk versus Information Loss


Adding very little noise (10% of the variance of the
individual variable) results in low information loss,
but also results in high disclosure risk
In order to decrease disclosure risk, it would be
necessary to increase the noise (say 50% of the
variance), but that would result in higher information
loss
Results for Noise Addition
(Noise level = 50%)

Mortgage Balance versus Asset balance (Noise Added)
At first glance, it does
not seem too bad, but
on closer observation,
we notice that there
are lots of negative
values that did not
exist in the original
data
 Negative
values can
be addressed
Correlation
There is a considerable difference between the original and masked data.
The correlations are considerably lower.
Marginal Distribution of
Home Value


The marginal
distribution is
completely modified
This is an unavoidable
consequence of any
noise “addition”
procedure
Summary


In summary, noise addition is a rudimentary
procedure that is easy to implement and easy to
explain. There is always a trade-off between
disclosure risk and information loss. If the disclosure
risk is low (high) then the corresponding information
loss is high (low).
Unfortunately, this is an inherent characteristic of all
noise based methods of the form Y = f(X,e) whether
the noise is additive or multiplicative or some other
form
Sufficiency based Noise Addition

Recently, we have developed a new technique that is
similar to noise addition, but maintains the mean
vector and covariance matrix of the masked data to
be the same as the original data

Offers the same characteristics as noise addition, but assures
that results for traditional statistical analyses using the
masked data will be the same as the original data
Sufficiency Based Noise Addition

Model:
yi = γ + αxi + βsi + εi

The only parameter that must selected is the “proximity
parameter” α. All other parameters are dictated by
the selection of this parameter
The Proximity Parameter

The parameter α (0 < α < 1) dictates the strength
of the relationship between X and Y.
 When
α = 1, Y = X.
 When α = 0, the perturbed variable is generated
independent of X (the GADP model to be discussed
later)
 We provide the ability to specify α to achieve any
degree of proximity between these two extremes
Other Model Parameters

γ = (1 – α) X – β S

β = (1 – α)(σXS/σ2SS)

ε ~ Normal(0, (1 – α2)((σXS)2/σ2SS)


Can be generated from other distributions
ε orthogonal to X and S
Note that …

In order to maintain sufficient statistics, it is
NECESSARY that the model for generating the
perturbed values MUST be specified in this manner
Disclosure Risk


There is a direct correspondence between the
proximity parameter α and the level of noise added
in the simple noise addition approach. This
procedure will result in incremental disclosure risk
except when α = 0
The level of noise added is approximately equal to
(1 – α2)
Information Loss

Information loss characteristics of the sufficiency
based approach is exactly the same as that of the
simple noise addition approach with one major
difference. Results of statistical analyses for which
the mean vector and covariance matrix are
sufficient statistics will be exactly the same using the
masked data as they are using the original data.
Results of Regression to predict Net Assets
using all other variables
Simple versus Sufficiency Based
Noise Addition

If noise addition will be used to mask the data, we
should always use sufficiency based noise addition
(and never simple noise addition). It provides all the
same characteristics of simple noise addition with
one major advantage that, for many traditional
statistical analyses, it provides the guarantee that
the masked data will yield the same results as the
original data.
Microaggregation

Replace the values of the variables for a set of k records in
close proximity with the average value of k records

Many different methods of determining close proximity



Univariate microaggregation where each variable is aggregated individually
Multivariate microaggregation where the values of all the confidential variables
for a given set of records are aggregated
Results in variance reduction and attenuation of covariance



All relationships are modified … some correlations higher others are
lower
Poor security even for relatively large k
Consistent with the idea of “k anonymity” since at least k records in the
data set will have the same values
Univariate MA (k = 5) Example
Good information loss characteristics but poor disclosure risk characteristics
Univariate MA (k = 100) Example
Worse information loss characteristics but better disclosure risk characteristics

Bill Winkler at the Census
Bureau has shown that the
risk of identity disclosure is
very high even with large k
Rank Based Data Swapping

Swap values of
variables within a
specified proximity


When the swapped values
are in close proximity, it
results in low information loss
but high disclosure risk and
vice versa
The proximity is usually
specified by the rank of the
record

The advantage of data
swapping is that it does not
change (or perturb) the
values; the original values
are used


The marginal distribution of
the masked data is exactly
the same as the original
Unfortunately, it results in
high information loss and
offers poor disclosure risk
characteristics
Data Swapping
(Rank Proximity = 0.2% or the closest 100 records)


Information loss is low
Unfortunately disclosure risk is
very high

The correlation between original
and masked net asset value is
0.999
Data Swapping
(Rank Proximity = 10% or the closest 5000 records)

Now information loss is
very high, but disclosure
risk is better
The problem with these approaches

There is an inherent problem with all approaches that
generate the perturbed value as a function of the
original value …. Y ~ f(X,e)



These include all noise addition approaches, data swapping,
microaggregation, and any variation of these approaches
Using Delanius’ definition of disclosure risk, all these
techniques result in disclosure
If we attempt to improve disclosure risk, it will
adversely affect information loss (and vice versa)
What we need …


Is a method that will ensure that the released of the
masked data does not result in any additional
disclosure, but provides characteristics for the
masked data that closely resemble the original data
From a statistical perspective, at least theoretically,
there is a relatively easy solution
Conditional Distribution Approach




Data set consisting of a set of non-confidential variables S and
confidential variables X
Identify the joint distribution f(S,X)
Compute the conditional distribution f(X|S)
Generate the masked values yi using f(X|S = si)


When S is null, simply generate a new data set with the same
characteristics as f(X)
Then the joint distribution of (S and Y) is the same as that of (S
and X)


f(S,Y) = f(S,X)
Little or no information loss since the joint distribution of the original and
masked data are the same
Disclosure Risk of CDA

When the masked data is generated using CDA, it
can be verified that f(X|Y,S,A) = f(X|S,A)
 Releasing
the masked microdata Y does not provide
any new information to the intruder over and above the
non-confidential variables S and A (the aggregate
information regarding the joint distribution of S and X)
CDA is the answer … but


The CDA approach results in very low information loss
and minimizes disclosure risk and represents a
complete solution to the data masking problem
Unfortunately, in practice
Identifying f(S,X) may be very difficult
 Deriving f(X|S) may be very difficult
 Generating yi using f(X|S) may be very difficult


In practice, it is unlikely that we can use the
conditional distribution approach
Model Based Approaches

Model based approaches for data masking
essentially attempt to model the data set by using an
assumed f*(S,X) for the joint distribution of (S and X),
derive f*(X|S), and generate the masked values from
this distribution
The masked data f(S,Y) will have the joint distribution f*(S,X)
rather than the true joint distribution f(S,X)
 If the data is generated using f*(X|S) then the masking
procedure minimizes disclosure risk since f(X|Y,S,A) =
f(X|S,A)

Disclosure risk example


Assume that we have one non-confidential variable S and one
confidential variable X
Y = (a × S) + e




(where e is the noise term)
We will always get better prediction if we attempt to predict X
using S rather than Y (since Y is noisier than S)
Since we have access to both S and Y, and since S would always
provide more information about X than Y, an intelligent intruder will
always prefer to use S to predict X than Y
More importantly, since Y is a function of S and random noise, once
S is used to predict X, including Y will not improve your predictive
ability
Model Based Masking Methods

Methods that we have developed and I will be
talking about
General additive data perturbation
 Copula based perturbation
 Data shuffling


Other Methods
PRAM
 Multiple imputation
 Skew t perturbation

General Additive Data Perturbation
(GADP)

A linear model based approach. Can maintain the
mean vector and covariance matrix of the masked
data to be exactly the same as the original data



The same as sufficiency based noise addition with
proximity parameter = 0
Ensures that the results of all traditional, parametric
statistical analyses using the masked data are exactly
the same as that using the original data
Ensure that the release of the masked microdata
results in no incremental disclosure
Procedure




From original data estimate the linear regression model X = β0 +
β1S + ε. Let b0 and b1 represent the estimates of β0 and β1 and let
Σee represent estimate of the covariance of the noise term ε.
Generate a set of noise terms e with mean vector 0 and covariance
matrix (exactly equal to) Σee and also orthogonal to both X and S.
Distribution of e is immaterial although typically MV normal.
Generate yi = b0 + b1Si + ei (i = 1 , 2, …, N)
The mean vector and covariance matrix of (S,Y) is exactly the same
as (S,X)

In the original GADP, these measures were maintained only
asymptotically. Burridge (2003) suggested the methodology for
maintaining these exactly. We modified this further to ensure minimum
disclosure risk (Muralidhar and Sarathy 2005).
Minimum Disclosure Risk

GADP results in minimizing disclosure risk. We can
show that an intruder would get the “best estimate”
of the confidential values using just the nonconfidential variables. The masked variables
provide no additional information.
Disclosure Risk
Predict original Home value using the masked data
Even if you …


Had say 90% of the entire data set, you would not be able to
predict the value of the confidential variables for the remaining
10% with any greater accuracy than you would using only the nonconfidential data
Had 100% of all confidential variables except one AND 90% of
the values for the last confidential variable, you would not be able
to predict the confidential value of remaining records with any
greater accuracy than you would using only the non-confidential
variables.
(Lack of) Information Loss

By maintaining the mean vector and covariance
matrix of the two data sets to be exactly the same,
for any statistical analysis for which the mean vector
and covariance matrix are sufficient statistics, we
ensure that the parameter estimates using the
masked data will be exactly the same as the
original data
Application to the Example
(Regression Analysis to predict Net Assets using all other variables)
Further Results
(Principal Components – Eigen values)
Further Results
(Principal Components – Eigen vectors)
But …

Unfortunately, the marginal distribution of the
original data set is altered significantly. In most
situations, the marginal distribution of the masked
variable bears little or no relationship to the
original variable
 The
data also could have negative values when the
original variable had only positive values
Marginal Distribution of Home Value
Negative values that
did not exist in the
original data
• The change in the marginal distribution means that other
analyses pertaining to the distribution of the confidential
variables are not maintained
– Residual analysis from regression would be very different
Non-Linear Relationships

Since a linear model is used, any non-linear
relationships that may have been present in the
data are modified (linearized)
GADP … Useful … But …


GADP is useful in a limited context. If the
confidential variables do not exhibit significant
deviations from normality, then GADP would
represent a good solution to the problem
In other cases, GADP represents a limited solution to
the specific users who will use the data mainly for
traditional statistical analysis
Improving GADP

We would like the masking procedure to provide
some additional benefits (while still minimizing
disclosure risk)
Maintain the marginal distribution
 Maintain non-linear relationships


To do this, we need to move beyond linear models

Multiplicative models are not very useful since, in essence,
they are just variations of the linear model
Copula Based GADP


In statistics, copulas have traditionally been used to
model the joint distribution of a set of variables with
arbitrary marginal distributions and a specified
dependence characteristics
the ability to maintain the marginal, nonnormal
distribution of the original attributes to be the same
after masking and to preserve certain types of
dependence between the attributes
Data Masking using the Multivariate
Normal Copula
Characteristics of the C-GADP


C-GADP minimizes disclosure risk
C-GADP provides the following information loss
characteristics



The marginal distribution of the confidential variables is
maintained
All monotonic relationships are preserved
 Rank order correlation
 Product moment correlation
Non-monotonic relationships will be modified
An Important Extension

Consider a situation where we have a confidential
variable X and a set of non-confidential variables S. If we
assume that the MV Copula is appropriate for modeling
the data, then the perturbed data Y can be viewed as an
independent realization from f(X|S). The marginal of Y is
simple a different realization from the same marginal as
X. This being the case, reverse map the original values of
X in place of the masked values Y. Now the “values” of Y
are the same as that of X, but they have been “shuffled”.
Data Shuffling
(US Patent 7200757)

In the above, we use the multivariate normal copula
to generate YP.
Characteristics of Data Shuffling

Offers all the benefits as CGADP
Minimum disclosure risk
 Information loss




Maintains the marginal distribution
Maintain all monotonic relationships
Additional benefits



There is no “modification” of the values. The original values are used
The marginal distribution of the masked data is exactly the same as
the original data
Implementation can be performed using only the ranks
A small example

Some shuffled values are far
apart, others are closer



Impossible to predict original
position after the fact which
assures low disclosure risk
Rank order correlation pre and
post masking are very close.
Improves with the size of the
data set
X is less correlated with Y and
more correlated with S
Data Shuffling on the Running Example
Maintaining Relationships
Maintaining Relationships
Advantages of Data Shuffling

Data shuffling is a hybrid (perturbation and
swapping), non-parametric (can be implemented only
with rank information) technique for data masking
that minimizes disclosure risk and offers the lowest
level of information loss among existing methods of
data masking
Will not maintain non-monotonic relationships
 Does not preserve tail dependence


Can be overcome by using t-copula instead of normal copula
Practically Viable

Data shuffling can be implemented easily even for
relatively large data sets. We are in the process of
developing two versions of software based on Data
shuffling
 Java
based for large applications
 Excel based for smaller applications
Future Research

Investigate other methods for modeling the joint
distribution of the variables to reduce information
loss further.
 Other
copula functions?
 Some other approach?


Investigate non-statistical approaches for producing
a masked data set that closely resembles the
original data (while minimizing disclosure risk)
Masking methods for discrete numerical data
Some Important References






Dalenius, T., “Towards a methodology for statistical disclosure control,”
Statistisktidskrift, 5, 429–444, 1977.
Fuller, W. A., “Masking procedures for microdata disclosure limitation,” Journal of
Official Statististics, 9, 383–406, 1993.
Rubin, D. B., “Discussion of statistical disclosure limitation,” Journal of Official
Statistics, 9, 461–468, 1993.
Moore, R. A., “Controlled data swapping for masking public use microdata sets,”
Research report series no. RR96/04, U.S. Census Bureau, Statistical Research
Division, Washington, D.C., 1996.
Burridge, J., “Information preserving statistical obfuscation,” Statistics and
Computing, 13, 321–327, 2003.
Domingo-Ferrer, J. and J.M. Mateo-Sanz, “Practical data-oriented
microaggregation for statistical disclosure control,” IEEE Transactions on Knowledge
and Data Engineering, 14, 189-201, 2002.
Our Publications Relating to Data Masking







Muralidhar, K. and R. Sarathy, " Generating Sufficiency-based Non-Synthetic
Perturbed Data," Transactions on Data Privacy, 1(1), 17-33, 2008.
Muralidhar, K. and R. Sarathy, "Data Shuffling- A New Masking Approach for
Numerical Data," Management Science, 52(5), 658-670, 2006.
Muralidhar, K. and R. Sarathy, “A Comparison of Multiple Imputation and Data
Perturbation for Masking Numerical Variables,” Journal of Official Statistics, 22(3),
507-524, 2006.
Muralidhar, K. and R. Sarathy, " A Theoretical Basis for Perturbation Methods,"
Statistics and Computing, 13(4), 329-335, 2003.
Sarathy, R., K. Muralidhar, and R. Parsa, "Perturbing Non-Normal Confidential
Attributes: The Copula Approach," Management Science, 48(12), 1613-1627, 2002.
Muralidhar, K., R. Parsa, and R. Sarathy, "A General Additive Data Perturbation
Method for Database Security," Management Science, 45(10), 1399-1415, 1999.
Muralidhar, K., D. Batra, and P. Kirs, “Accessibility, Security, and Accuracy in
Statistical Databases: The Case for the Multiplicative Fixed Data Perturbation
Approach,” Management Science, 41(9), 1549-1564,1995.
Other Related Research

Assessing disclosure risk




Muralidhar, K. and R. Sarathy, "Security of Random Data Perturbation
Methods," ACM Transactions on Database Systems, 24(4), 487-493, 1999.
Sarathy, R. and K. Muralidhar, "The Security of Confidential Numerical Data in
Databases," Information Systems Research, 13(4), 389-403, 2002.
Li, H., K. Muralidhar, and R. Sarathy, “Assessment of Disclosure Risk when using
Confidentiality via Camouflage,” Operations Research, 55(6), 1178-1182,
2007.
Framework for evaluating masking techniques

Muralidhar, K. and R. Sarathy, “A Theoretical Comparison of Data Masking
Techniques for Numerical Microdata,” to be presented at the 3rd IAB
Workshop on Confidentiality and Disclosure - SDC for Microdata, Nuremberg,
Germany, 2008
Web Site URL

You can many of our papers and presentations at our
web site:
http://gatton.uky.edu/faculty/muralidhar/maskingpapers/

I will be happy to share any papers or presentations that
are not available on the web site.
Conclusion

There are a host of techniques that are available for masking
numerical data. These techniques have a long history in the
statistical disclosure limitation literature. There is considerable
overlap between the data masking research in the statistical
disclosure limitation research community and the privacy
preserving data mining research in the CS community.
Unfortunately, there seems to be only a limited cooperation
between the researchers in the two fields. I believe that each
field can make a significant contribution to the other. I hope
that this presentation contributes to enhancing the discussion
between CS and SDL researchers … at least at UK.
Questions,
Suggestions or
Comments?
Thank you
Download