ACHIEVING k-ANONYMITY PRIVACY

advertisement
L. Sweeney. Achieving k-anonymity privacy protection using generalization and suppression.
International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10 (5), 2002; 571588.
ACHIEVING k-ANONYMITY PRIVACY PROTECTION
USING GENERALIZATION AND SUPPRESSION1
LATANYA SWEENEY
School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
E-mail: latanya@cs.cmu.edu
Received May 2002
Often a data holder, such as a hospital or bank, needs to share person-specific records in
such a way that the identities of the individuals who are the subjects of the data cannot be
determined. One way to achieve this is to have the released records adhere to kanonymity, which means each released record has at least (k-1) other records in the
release whose values are indistinct over those fields that appear in external data. So, kanonymity provides privacy protection by guaranteeing that each released record will
relate to at least k individuals even if the records are directly linked to external
information. This paper provides a formal presentation of combining generalization and
suppression to achieve k-anonymity. Generalization involves replacing (or recoding) a
value with a less specific but semantically consistent value. Suppression involves not
releasing a value at all. The Preferred Minimal Generalization Algorithm (MinGen),
which is a theoretical algorithm presented herein, combines these techniques to provide
k-anonymity protection with minimal distortion. The real-world algorithms Datafly and
µ-Argus are compared to MinGen. Both Datafly and µ-Argus use heuristics to make
approximations, and so, they do not always yield optimal results. It is shown that Datafly
can over distort data and µ-Argus can additionally fail to provide adequate protection.
Keywords: data anonymity, data privacy, re-identification, data fusion, privacy.
1. Introduction
Today’s globally networked society places great demand on the collection and
sharing of person-specific data for many new uses [1]. This happens at a time
when more and more historically public information is also electronically
available. When these data are linked together, they provide an electronic image
of a person that is as identifying and personal as a fingerprint even when the
information contains no explicit identifiers, such as name and phone number.
Other distinctive data, such as birth date and postal code, often combine uniquely
[2] and can be linked to publicly available information to re-identify individuals.
1
This paper significantly amends and expands the earlier paper “Protecting privacy when disclosing
information: k-anonymity and its enforcement through generalization and suppression” (with
Samarati) submitted to IEEE Security and Privacy 1998, and extends parts of my Ph.D. thesis [10].
Page 1
L. Sweeney. Achieving k-anonymity privacy protection using generalization and suppression.
International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10 (5), 2002; 571588.
So in today’s technically-empowered data rich environment, how does a data
holder, such as a medical institution, public health agency, or financial
organization, share person-specific records in such a way that the released
information remain practically useful but the identity of the individuals who are
the subjects of the data cannot be determined? One way to achieve this is to have
the released information adhere to k-anonymity [3]. A release of data is said to
adhere to k-anonymity if each released record has at least (k-1) other records also
visible in the release whose values are indistinct over a special set of fields called
the quasi-identifier [4]. The quasi-identifier contains those fields that are likely to
appear in other known data sets. Therefore, k-anonymity provides privacy
protection by guaranteeing that each record relates to at least k individuals even if
the released records are directly linked (or matched) to external information.
This paper provides a formal presentation of achieving k-anonymity using
generalization and suppression. Generalization involves replacing (or recoding) a
value with a less specific but semantically consistent value. Suppression involves
not releasing a value at all. While there are numerous techniques available2,
combining these two offers several advantages.
First, a recipient of the data can be told what was done to the data. This
allows results drawn from released data to be properly interpreted. Second,
information reported on each person is “truthful” which makes resulting data
useful for fraud detection, counter-terrorism surveillance, healthcare outcome
assessments and other uses involving traceable person-specific patterns3. Third,
these techniques can provide results with guarantees of anonymity that are
minimally distorted. Any attempt to provide anonymity protection, no matter how
minor, involves modifying the data and thereby distorting its contents, so the goal
is to distort minimally. Fourth, these techniques can be used with preferences a
recipient of the released data may have, thereby providing the most useful data
possible. In this way, algorithmic decisions about how to distort the data can have
minimal impact on the data’s fitness for a particular task.
Finally, the real-world systems Datafly [5] and µ-Argus [6], which are
discussed in subsequent sections, use these techniques to achieve k-anonymity.
Therefore, this work provides a formal basis for comparing them.
2. Background
The ideas of k-anonymity and of a quasi-identifier are straightforward.
Nevertheless, care must be taken to precisely state what is meant. [3] provides a
detailed discussion of k-anonymity. A brief summary is provided in this section as
background for the upcoming presentations on generalization and suppression.
2
See Willenborg and De Waal [2] for a list of traditional statistical techniques.
In contrast, other techniques (e.g., additive noise) can destroy the “truthfulness” of the information
reported on a person even though they can maintain aggregate statistical properties.
3
Page 2
L. Sweeney. Achieving k-anonymity privacy protection using generalization and suppression.
International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10 (5), 2002; 571588.
Unless otherwise stated, the term data refers to person-specific information
that is conceptually organized as a table of rows (or records) and columns (or
fields). Each row is termed a tuple. Tuples within a table are not necessarily
unique. Each column is called an attribute and denotes a semantic category of
information that is a set of possible values; therefore, an attribute is also a domain.
Attributes within a table are unique. So by observing a table, each row is an
ordered n-tuple of values <d1, d2, …, dn> such that each value dj is in the domain
of the j-th column, for j=1, 2, …, n where n is the number of columns. This
corresponds to relational database concepts [7].
Let B(A1,…,An) be a table with a finite number of tuples. The finite set of
attributes of B are {A1,…,An}. Given a table B(A1,…,An), {Ai,…,Aj} ⊆ {A1,…,An},
and a tuple t∈B, I use t[Ai,…,Aj] to denote the sequence of the values, vi,…,vj, of
Ai,…,Aj in t. I use B[Ai,…,Aj] to denote the projection, maintaining duplicate
tuples, of attributes Ai,…Aj in B.
Throughout this work each tuple is assumed to be specific to a person and no
two tuples pertain to the same person. This assumption simplifies discussion
without loss of applicability. Also, this discussion focuses on protecting identity
in person-specific data, but is just as applicable to protecting other kinds of
information about other kinds of entities (e.g., companies or governments).
Limiting the ability to link (or match) released data to other external
information offers privacy protection. Attributes in the private information that
could be used for linking with external information are termed the quasi-identifier.
Such attributes not only include explicit identifiers such as name, address, and
phone number, but also include attributes that in combination can uniquely
identify individuals such as birth date, ZIP4, and gender [8]. A goal of this work is
to release person-specific data such that the ability to link to other information
using the quasi-identifier is limited.
Definition 1. Quasi-identifier
Given a population U, a person-specific table T(A1,…,An), fc: U → T and fg: T
→ U', where U ⊆ U'. A quasi-identifier of T, written QT, is a set of attributes
{Ai,…,Aj} ⊆ {A1,…,An} where: ∃pi∈U such that fg(fc(pi)[QT]) = pi.
Definition 2. k-anonymity
Let RT(A1,...,An) be a table and QIRT be the quasi-identifier associated with it.
RT is said to satisfy k-anonymity if and only if each sequence of values in
RT[QIRT] appears with at least k occurrences in RT[QIRT].
4
In the United States, a ZIP code refers to the postal code. Typically 5-digit ZIP codes are used,
though 9-digit ZIP codes have been assigned. A 5-digit code is the first 5 digits of the 9-digit code.
Page 3
L. Sweeney. Achieving k-anonymity privacy protection using generalization and suppression.
International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10 (5), 2002; 571588.
Example 1. Table adhering to k-anonymity
Figure 1 contains table T, which adheres to k-anonymity. The quasi-identifier
is QIT= {Race, Birth, Gender, ZIP} and k=2. Therefore, for each of the tuples
contained in T, the values that comprise the quasi-identifier appear at least
twice in T. In particular, t1[QIT] = t2[QIT], t3[QIT] = t4[QIT], and t5[QIT] =
t6[QIT] = t7[QIT].
t1
t2
t3
t4
t5
t6
t7
Race
Black
Black
Black
Black
White
White
White
Birth
1965
1965
1964
1964
1964
1964
1964
Gender
m
m
f
f
m
m
m
ZIP
02141
02141
02138
02138
02138
02138
02138
Problem
short breath
chest pain
obesity
chest pain
chest pain
obesity
short breath
Figure 1 Example of k-anonymity, where k=2 and QI={Race, Birth, Gender, ZIP}
Theorem 1
Let RT(A1,...,An) be a table, QIRT =(Ai,…, Aj) be the quasi-identifier associated
with RT, Ai,…,Aj ⊆ A1,…,An, and RT satisfy k-anonymity. Then, each
sequence of values in RT[Ax] appears with at least k occurrences in RT[QIRT]
for x=i,…,j.
Example 2. k occurrences of each value under k-anonymity
Table T in Figure 1 adheres to k-anonymity. Therefore, each value associated
with an attribute of QI in T appears at least k times. |T[Race ="black"]| = 4.
|T[Race ="white"]| = 3. |T[Birth ="1964"]| = 5. |T[Birth ="1965"]| = 2.
|T[Gender ="m"]| = 5. |T[Gender ="f"]| = 2. |T[ZIP ="02138"]| = 5. And,
|T[ZIP ="02141"]| = 2.
It can be trivially proven that if the released data RT satisfies k-anonymity with
respect to the quasi-identifier QIPT, then the combination of the released data RT
and the external sources on which QIPT was based, cannot link on QIPT or a subset
of its attributes to match fewer than k individuals.
3. Methods
In this section I present formal notions of: (1) generalization incorporating
suppression; (2) minimal generalization; and, (3) minimal distortion. The
Preferred Minimal Generalization Algorithm (MinGen), which ends this section,
combines these notions into a theoretical algorithm that uses generalization and
suppression to produce tables that adhere to k-anonymity with minimal distortion.
Page 4
L. Sweeney. Achieving k-anonymity privacy protection using generalization and suppression.
International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10 (5), 2002; 571588.
3.1. Generalization including suppression
The idea of generalizing an attribute is a simple concept. A value is replaced by a
less specific, more general value that is faithful to the original. In Figure 2 the
original ZIP codes {02138, 02139} can be generalized to 0213*, thereby stripping
the rightmost digit and semantically indicating a larger geographical area.
In a classical relational database system, domains are used to describe the set
of values that attributes assume. For example, there might be a ZIP domain, a
number domain and a string domain. I extend this notion of a domain to make it
easier to describe how to generalize the values of an attribute. In the original
database, where every value is as specific as possible, every attribute is considered
to be in a ground domain. For example, 02139 is in the ground ZIP domain, Z0.
In order to achieve k-anonymity I can make ZIP codes less informative. I do this
by saying that there is a more general, less specific domain that can be used to
describe ZIPs, say Z1, in which the last digit has been replaced by 0 (or removed
altogether). There is also a mapping from Z0 to Z1, such as 02139 → 0213*.
Given an attribute A, I say a generalization for an attribute is a function on A.
That is, each f: A → B is a generalization. I also say that:
f0
f1
f n−1
Ao →
A1 →
K 
→ An
is a generalization sequence or a functional generalization sequence.
Given an attribute A of a private table PT, I define a domain generalization
hierarchy DGHA for A as a set of functions fh : h=0,…,n-1 such that:
f0
f1
f n−1
Ao →
A1 →
K 
→ An
n
A=A0 and |An| = 1. DGHA is over:
UA
h
h =0
Clearly, the fh’s impose a linear ordering on the Ah’s where the minimal element is
the ground domain A0 and the maximal element is An. The singleton requirement
on An ensures that all values associated with an attribute can eventually be
generalized to a single value. In this presentation I assume Ah, h=0,…,n, are
disjoint; if an implementation is to the contrary and there are elements in common,
then DGHA is over the disjoint sum of Ah’s and definitions change accordingly.
Given a domain generalization hierarchy DGHA for an attribute A, if vi∈Ai
and vj∈Aj then I say vi ≤ vj if and only if i ≤ j and:
f j −1 (K f i (vi )K) = v j
This defines a partial ordering ≤ on:
n
UA
h
h =0
Such a relationship implies the existence of a value generalization hierarchy
VGHA for attribute A.
I expand my representation of generalization to include suppression by
imposing on each value generalization hierarchy a new maximal element, atop the
old maximal element. The new maximal element is the attribute's suppressed
Page 5
L. Sweeney. Achieving k-anonymity privacy protection using generalization and suppression.
International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10 (5), 2002; 571588.
value. The height of each value generalization hierarchy is thereby incremented
by one. No other changes are necessary to incorporate suppression. Figure 2 and
Figure 3 provides examples of domain and value generalization hierarchies
expanded to include the suppressed maximal element (*****). In this example,
domain Z0 represents ZIP codes for Cambridge, MA, and E0 represents race.
From now on, all references to generalization include the new maximal element;
and, hierarchy refers to domain generalization hierarchies unless otherwise noted.
Z3={*****}
*****
Z ={021**}
Z ={0213*,0214*}
021**
2
0213*
0214*
02138 02139
02141 02142
1
Z0={02138, 02139, 02141, 02142}
VGHZ0
DGHZ0
Figure 2 ZIP domain and value generalization hierarchies including suppression
Z2={******}
******
Z ={Person}
Person
1
Z0={Asian,Black,White}
Asian
Black
White
VGHE0
DGHE0
Figure 3 Race domain and value generalization hierarchies including suppression
3.2. Minimal generalization of a table
Given table PT, generalization can be effective in producing a table RT based on
PT that adheres to k-anonymity because values in RT are substituted with their
generalized replacements. The number of distinct values associated with each
attribute is non-increasing, so the substitution tends to map values to the same
result, thereby possibly decreasing the number of distinct tuples in RT.
A generalization function on tuple t with respect to A1,…, An is a function ft on
A1×…×An such that: f t ( A1 , , An ) = ( f t1 ( A1 ), , f tn ( An ))
where for each i: 1,…,n, fti is a generalization of the value t[Ai]. The function ft is
a set function. I say ft is generated by the fti’s.
Given f, A1,…,An, a table T(A1,…,An) and a tuple t∈T, i.e., t(a1,…,an)
g (T ) = k ⋅ f (t ) : t ∈ T and f −1 ( f (t )) = k
K
K
{
}
Page 6
L. Sweeney. Achieving k-anonymity privacy protection using generalization and suppression.
International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10 (5), 2002; 571588.
The function g is a multi-set function and f-1 is the inverse function of f. I say that
g is the multi-set function generated by f and by the fi’s. Further, I say that g(T) is
a generalization of table T. This does not mean, however, that the generalization
respects the value generalization hierarchy for each attribute in T. To determine
whether one table is a generalization with respect to the value generalization
hierarchy of each attribute requires analyzing the values themselves.
Let DGHi be the domain generalization hierarchies for attributes Ali where
i=1,…,An Let Tl[Al1,…,AlAn] and Tm[Am1,…,AmAn] be two tables such that for each
i:1,..,n, Ali,Ami∈DGHi. Then, I say table Tm is a generalization of table Tl,
written Tl ≤ Tm, if and only if there exists a generalization function g such that
g[Tl] = Tm and is generated by fi’s where: ∀tl∈Tl, ali ≤ fi(ali) = ami and fi : Ali → Ami
and each fi is in the DGHi of attribute Ali. From this point forward, I will use the
term generalization (as a noun) to denote a generalization of a table.
Race
ZIP
Race
ZIP
Race
ZIP
Race
ZIP
Race
ZIP
E0
Z0
02138
02139
02141
02142
02138
02139
02141
02142
E1
Z0
02138
02139
02141
02142
02138
02139
02141
02142
E1
Z1
0213*
0213*
0214*
0214*
0213*
0213*
0214*
0214*
E0
Black
Black
Black
Black
White
White
White
White
Z2
021**
021**
021**
021**
021**
021**
021**
021**
E0
Black
Black
Black
Black
White
White
White
White
Z1
0213*
0213*
0214*
0214*
0213*
0213*
0214*
0214*
Black
Black
Black
Black
White
White
White
White
PT
Person
Person
Person
Person
Person
Person
Person
Person
GT[1,0]
Person
Person
Person
Person
Person
Person
Person
Person
GT[1,1]
GT[0,2]
GT[0,1]
Figure 4 Examples of generalized tables for PT
Definition 3. k-anonymity requirement of a generalized table
Let PT(A1,…,An) be a table, QIPT={Ai,…,Aj}, be the quasi-identifier
associated with PT where {Ai,…,Aj} ⊆ {A1,…,An}, RT(Ai,…,Aj) be a
generalization of PT with respect to QIPT, t∈RT[QIPT] and kt be the integer
denoted in g for f(t). RT is said to satisfy k-anonymity for k with respect to
QIPT if ∀t∈RT[QIPT], kt ≥ k and k ≥ 2.
The k-anonymity requirement guarantees each tuple in the generalized table
RT[QIPT] is indistinguishable from at least k-1 other tuples in table RT[QIPT].
Example 3. k-anonymity requirement of a generalized table
Consider PT and its generalized tables in Figure 4 and the hierarchies in
Figure 2 and Figure 3. GT[0,1] and GT[1,0] satisfy k-anonymity for k = 2; and,
GT[0,2] and GT[1,1] satisfy k-anonymity for k = 2, 3, 4.
Page 7
L. Sweeney. Achieving k-anonymity privacy protection using generalization and suppression.
International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10 (5), 2002; 571588.
The number of different generalizations of a table T, when generalization is
enforced at the attribute level, is equal to the number of different combinations of
domains that the attributes in the table can assume. Given domain generalization
hierarchies DGHi for attributes Ai, i:1,…,n; the number of generalizations,
enforced at the attribute level, for table T(A1,…,An) is:
∏ (DGH
n
i =1
i
+ 1)
Equation 1
Clearly, not all such generalizations are equally satisfactory. A generalization
whose values result from generalizing each attribute to the highest possible level
collapses all tuples in the table to the same list of values. This provides kanonymity using more generalization than needed if a less generalized table exists
which satisfies k-anonymity. This concept is captured in the following definition.
Definition 4. k-minimal generalization
Let Tl(A1,…,An) and Tm(A1,…,An) be two tables such that Tl[QIT] ≤ Tm[QIT],
where QIT={Ai,…,Aj} is the quasi-identifier associated with the tables and
{Ai,…,Aj} ⊆ {A1,…,An}. Tm is said to be a minimal generalization of a table Tl
with respect to a k anonymity requirement over QIT if and only if:
1. Tm satisfies the k-anonymity requirement with respect to QIT
2. ∀Tz: Tl ≤ Tz, Tz ≤ Tm, Tz satisfies the k-anonymity requirement
with respect to QIT ⇒ Tz[Ai,…,Aj] = Tm[Ai,…,Aj].
Example 4. k-minimal generalization
Figure 4 shows generalizations, enforced at the attribute level, of PT over
{Race, ZIP}. Each generalization satisfies k-anonymity for k=2. GT[0,1]
generalized ZIP one level. GT[1,0] generalized Race one level. So, GT[1,1], and
GT[0,2] did more generalization than necessary. GT[0,2] is a generalization of
GT[0,1]. GT[1,1] is a generalization of both GT[1,0] and GT[0,1].
So, table Tm, generalization of Tl, is k-minimal if it satisfies k-anonymity and there
does not exist a generalization of Tl satisfying k-anonymity of which Tm is a
generalization.
3.3. Minimal distortion of a table
When different k-minimal generalizations exist, preference criteria can be applied
to choose a solution among them. A way of preferring one to another is to select
one whose information is most useful. Generalizing a table T results in a table T'
that typically has less information than T; and so, T' is considered less useful. In
order to capture the information loss, I define an information theoretic metric that
Page 8
L. Sweeney. Achieving k-anonymity privacy protection using generalization and suppression.
International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10 (5), 2002; 571588.
reports the amount of distortion in a generalized table.5 In a cell of a generalized
table, the ratio of the domain of the value found in the cell to the height of the
attribute’s hierarchy reports the amount of generalization and thereby measures
the cell’s distortion. Precision of a generalized table is then one minus the sum of
all cell distortions (normalized by the total number of cells), as defined below.
Definition 5. precision metric Prec
Let PT(A1,...,ANa) be a table, tPj∈PT, RT(A1,...,ANa) be a generalization of PT,
tPj∈PT, each DGHA be the domain generalization hierarchy for attribute A,
and fi's be generalizations on A. The precision of RT, written Prec(RT), based
on generalization and suppression is:
NA N
h
∑∑
DGH
i =1 j =1
Ai
Prec(RT ) = 1 −
where f1 ( f h (t Pj [Ai ]) ) = t Rj [Ai ]
PT • N A
K
K
Example 5. precision metric
In the case where PT = RT, each value is in the ground domain so each h = 0;
therefore, Prec(RT) = 1. Conversely, in the case where each value in RT is
the maximal element of its hierarchy, each h =|DGHAi|; and so, Prec(RT) = 0.
Example 6. precision metric
Using the hierarchies in Figure 2 and Figure 3, the precision of the
generalizations of PT shown in Figure 4 are: Prec(GT[1,0]) = 0.75;
Prec(GT[1,1]) = 0.58; Prec(GT[0,2]) = 0.67; and, Prec(GT[0,1]) = 0.83. Each of
these satisfy k-anonymity for k=2, but GT[0,1] does so with the least distortion.
Notice GT[1,0] and GT[0,1] each generalize values up one level, but because
|DGHRace| = 2 and |DGHZIP| = 3, Prec(GT[0,1]) > Prec(GT[1,0]).
Generalizations based on attributes with taller generalization hierarchies typically
maintain precision better than generalizations based on attributes with shorter
hierarchies. Further, hierarchies with different heights can provide different Prec
measures for the same table. So, the construction of generalization hierarchies is
part of the preference criteria. Prec best measures the quality of the data when the
set of hierarchies used contain only values semantically useful. There is no need
to arbitrarily increase the heights of hierarchies solely to prefer one attribute to
another. Instead, weights can be assigned to attributes in Prec to make the
preference explicit [9].
5
While entropy is the classical measure used in information theory to characterize the purity of data
[5], a metric based on the semantics of generalization can be more discriminating than the direct
comparison of the encoding lengths of the values stored in the table.
Page 9
L. Sweeney. Achieving k-anonymity privacy protection using generalization and suppression.
International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10 (5), 2002; 571588.
As stated earlier, not all k-minimal generalizations are equally distorted and
preference can be based on the k-minimal generalization having the most
precision. This concept is captured by the following definition.
Definition 6. k-minimal distortion
Let Tl(A1,…,An) and Tm(A1,…,An) be two tables such that Tl[QIT] ≤ Tm[QIT],
where QIT={Ai,…,Aj} is the quasi-identifier associated with the tables and
{Ai,…,Aj} ⊆ {A1,…,An} and ∀x=i,…,j, DGHAx are domain generalization
hierarchies for QIT. Tm is said to be a minimal distortion of a table Tl with
respect to a k anonymity requirement over QIT if and only if:
1. Tm satisfies the k-anonymity requirement with respect to QIT
2. ∀Tz: Prec(Tl) ≥ Prec(Tz), Prec(Tz) ≥ Prec(Tm), Tz satisfies the kanonymity requirement with respect to QIT
Tz[Ai,…,Aj] =
Tm[Ai,…,Aj].
⇒
Consider the values reported in Example 6 about the tables in Figure 4. Only
GT[0,1] is a k-minimal distortion of PT. A k-minimal distortion is specific to a
table, a quasi-identifier, a set of domain generalization hierarchies for the
attributes of the quasi-identifier, and Prec (or a weighted Prec).
It is trivial to see that a table that satisfies k-anonymity has a unique kminimal distortion, which is itself. It is also easy to see that a generalized table
RT that is a k-minimal distortion of table PT is also a k-minimal generalization of
PT, as stated in the following theorem.
Theorem 2
Given tables Tl and Tm such that Tl ≤ Tm and Tm satisfies k-anonymity. Tm is a kminimal distortion of Tl Tm is k-minimal generalization of Tl.
⇒
3.4. Algorithm for finding a minimal generalization with minimal distortion
The algorithm presented in this section combines these formal definitions into a
theoretical model against which real-world systems will be compared.
Figure 5 presents an algorithm, called MinGen, which, given a table
PT(Ax,…,Ay), a quasi-identifier QI={A1,…,An}, where {A1,…,An} ⊆ {Ax,…,Ay}, a
k-anonymity constraint, and domain generalization hierarchies DGHAi, produces a
table MGT which is a k-minimal distortion of PT[QI]. It assumes that k < |PT|,
which is a necessary condition for the existence of a k-minimal generalization.
The steps of the MinGen algorithm are straightforward. [step 1] Determine if
the original table, PT, itself satisfies the k-anonymity requirement; and if so, it is
the k-minimal distortion. In all other cases execute step 2. [step 2.1] Store the set
of all possible generalizations of PT over QI into allgens. [step 2.2] Store those
generalizations from allgens that satisfy the k-anonymity requirement into
Page 10
L. Sweeney. Achieving k-anonymity privacy protection using generalization and suppression.
International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10 (5), 2002; 571588.
protected. [step 2.3] Store the k-minimal distortions (based on Prec) from
protected into MGT. It is guaranteed that |MGT| ≥ 1. [step 2.4] Finally, the
function preferred() returns a single k-minimal distortion from MGT based on
user-defined specifications6.
Input: Private Table PT; quasi-identifier QI = (A1, …, An),
k constraint; domain generalization hierarchies
DGHAi, where i=1,…,n, and preferred() specifications.
Output: MGT, a minimal distortion of PT[QI] with respect to k
chosen according to the preference specifications
Assumes: |PT |≥ k
Method:
1. if PT[QI] satisfies k-anonymity requirement with respect to k then do
1.1. MGT ← { PT } // PT is the solution
2. else do
2.1. allgen ← {Ti : Ti is a generalization of PT over QI}
2.2. protected ← {Ti : Ti ∈ allgen ∧ Ti satisfies k-anonymity of k}
2.3. MGT ← {Ti : Ti ∈ protected ∧ there does not exist Tz ∈ protected
such that Prec(Tz) > Prec(Ti) }
2.4. MGT ← preferred(MGT) // select the preferred solution
3. return MGT
Figure 5 Preferred Minimal Generalization (MinGen) Algorithm
Example 7. MinGen produces k-minimal distortions
Let the hierarchies in Figure 2, Figure 3 and Figure 6 for the quasi-identifier
QI={Race, BirthDate, Gender, ZIP}, the table PT in Figure 7, and a kanonymity constraint of k=2 be provided to MinGen. After step 2.3 of
MinGen, MGT= {GT}, where GT is shown in Figure 7. GT is a k-minimal
distortion of PT over QI and a k-minimal generalization of PT over QI.
It can be proved that a generalization of a table T over a quasi-identifier QI, that
satisfies a given k-anonymity requirement, and has the least amount of distortion
of all possible generalizations of T over QI, is a k-minimal distortion of T over QI
with respect to Prec. From Theorem 2, the solution is also a k-minimal
generalization of T over QI.
With respect to complexity, MinGen makes no claim to be efficient. If
generalization is enforced at the attribute level, the number of possible
6
The preferred() function returns only one table as a solution. The single solution requirement is a
necessary condition because the chosen solution becomes part of the join of external information
against which subsequent linking must be protected.
Page 11
L. Sweeney. Achieving k-anonymity privacy protection using generalization and suppression.
International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10 (5), 2002; 571588.
generalizations, |allgens|, is expressed in Equation 1. If generalization is enforced
∏ (DGH
n
at the cell level, |allgens| =
i =1
Ai
+ 1)
PT
. Equation 2
Clearly, an exhaustive search of all possible generalizations is impractical even on
modest sized tables. So, how do real-world systems find solutions in real-time?
Suppressed value
********
10 year range:
1960-69
5 year ranges:
1960-64
1 year range
1965-69
1960 1961 1962 1963 1964
1965 1966 1967 1968 1969
month/year
full date
Birth Date
Figure 6 Value generalization hierarchies for {Gender, BirthDate} with suppression
Race
black
black
black
black
black
black
white
white
white
white
white
white
BirthDate
9/20/1965
2/14/1965
10/23/1965
8/24/1965
11/7/1964
12/1/1964
10/23/1964
3/15/1965
8/13/1964
5/5/1964
2/13/1967
3/21/1967
Gender
male
male
female
female
female
female
male
female
male
male
male
male
PT
ZIP
Problem
02141 short of breath
02141 chest pain
02138 painful eye
02138 wheezing
02138 obesity
02138 chest pain
02138 short of breath
02139 hypertension
02139 obesity
02139 fever
02138 vomiting
02138 back pain
Prec =1.00
Race
black
black
person
person
black
black
white
person
white
white
white
white
BirthDate
1965
1965
1965
1965
1964
1964
1960-69
1965
1964
1964
1960-69
1960-69
Gender
male
male
female
female
female
female
male
female
male
male
male
male
GT
ZIP
Problem
02141 short of breath
02141 chest pain
0213* painful eye
0213* wheezing
02138 obesity
02138 chest pain
02138 short of breath
0213* hypertension
02139 obesity
02139 fever
02138 vomiting
02138 back pain
Prec =0.90
Figure 7 k-minimal distortion for PT where k=2
4. Real-world results
Here are two real-world systems that seek to provide k-anonymity protection using
generalization and suppression. They are: (1) my Datafly7 Systems [5]; and, (2)
Statistics Netherlands' µ-Argus System [6]. This section shows that Datafly can
over distort the data and that µ-Argus can fail to provide adequate protection.
7
The systems Datafly and Datafly II refer to two kindred algorithms. The differences between them
do not substantially alter the findings reported herein, so in this writing, the term Datafly refers to a
simplified abstraction of these algorithms known as the core Datafly algorithm [9].
Page 12
L. Sweeney. Achieving k-anonymity privacy protection using generalization and suppression.
International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10 (5), 2002; 571588.
4.1. The Datafly System
Here is a summary of the setting in which the core Datafly algorithm operates.
The data holder (1) declares specific attributes and tuples in the original private
table (PT) as being eligible for release. The data holder also (2) groups a subset
of attributes of PT into one or more quasi-identifiers (QIi) and assigns (3) a weight
from 0 to 1 to each attribute within each QIi that specifies the likelihood the
attribute will be used for linking; a 0 value means not likely and a value of 1
means highly probable. The data holder (4) specifies a minimum anonymity level
that computes to a value for k. Finally, (5) a weight from 0 to 1 is assigned to
each attribute within each QIi to state a preference of which attributes to distort; a
0 value means the recipient of the data would prefer the values not to be changed
and a value of 1 means maximum distortion could be tolerated. For convenience
in this discussion, I remove many of the finer features. I consider a single quasiidentifier, where all attributes of the quasi-identifier have equal preference and an
equal likelihood for linking so the weights can be considered as not being present.
Figure 8 presents the core Datafly algorithm, which, given a table
PT(Ax,…,Ay), a quasi-identifier QI={A1,…,An}, where {A1,…,An} ⊆ {Ax,…,Ay}, a
k-anonymity constraint, and domain generalization hierarchies DGHAi, produces a
table MGT which is a generalization of PT[QI] that satisfies k-anonymity. It
assumes that k < |PT|, which is a necessary condition for satisfying k-anonymity.
Input: Private Table PT; quasi-identifier QI = (A1, …, An),
k constraint; hierarchies DGHAi, where i=1,…,n.
Output: MGT, a generalization of PT[QI] with respect to k
Assumes: |PT |≥ k
Method:
1. freq ← a frequency list contains distinct sequences of values of PT[QI],
along with the number of occurrences of each sequence.
2. while there exists sequences in freq occurring less than k times
that account for more than k tuples do
2.1. let Aj be attribute in freq having the most number of distinct values
2.2. freq ← generalize the values of Aj in freq
3. freq ← suppress sequences in freq occurring less than k times.
4. freq ← enforce k requirement on suppressed tuples in freq.
5. Return MGT ← construct table from freq
Figure 8 Core Datafly Algorithm
The core Datafly algorithm has few steps. Step 1 constructs freq, which is a
frequency list containing distinct sequences of values from PT[QI], along with the
number of occurrences of each sequence. Each sequence in freq represents one or
more tuples in a table. Step 2.1 uses a heuristic to guide generalization. The
attribute having the most number of distinct values in freq is generalized.
Page 13
L. Sweeney. Achieving k-anonymity privacy protection using generalization and suppression.
International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10 (5), 2002; 571588.
Generalization continues until there remains k or fewer tuples having distinct
sequences in freq. Step 3 suppresses any sequences of freq occurring less than k
times. Complimentary suppression is performed in step 4 so that the number of
suppressed tuples satisfies the k requirement. Finally, step 5 produces a table
MGT, based on freq such that the values stored as a sequence in freq appear as
tuple(s) in MGT replicated in accordance to the stored frequency.
Let the hierarchies in Figure 2, Figure 3 and Figure 6 for the quasi-identifier
QI={Race, BirthDate, Gender, ZIP}, the table PT in Figure 7, and a k-anonymity
constraint of k=2 be provided to Datafly. Figure 9A shows the contents of freq
after step 1. Birthdate has the most number of distinct values (12) so its values
are generalized. Figure 9B shows the contents of freq after step 2. Tuples t7 and
t8 will be suppressed. Table MGT in Figure 10 is the final result. MGT[QI]
satisfies k-anonymity for k=2 with Prec(MGT[QI]) = 0.75. However, from
Example 7 and Figure 7, GT is a k-minimal distortion for PT with
Prec(GT[QI])=0.90. So, Datafly distorted the results more than needed.
Race
black
black
black
black
black
black
white
white
white
white
white
white
BirthDate
9/20/65
2/14/65
10/23/65
8/24/65
11/7/64
12/1/64
10/23/64
3/15/65
8/13/64
5/5/64
2/13/67
3/21/67
2
12
Gender
male
male
female
female
female
female
male
female
male
male
male
male
2
ZIP
#occurs
02141
1
t1
02141
1
t2
02138
1
t3
02138
1
t4
02138
1
t5
02138
1
t6
02138
1
t7
02139
1
t8
02139
1
t9
02139
1
t10
02138
1
t11
02138
1
t12
3
Race
black
black
black
white
white
white
white
BirthDate
1965
1965
1964
1964
1965
1964
1967
2
3
A
Gender
male
female
female
male
female
male
male
2
ZIP
#occurs
02141
2
t1,t2
02138
2
t3, t4
02138
2
t5, t6
02138
1
t7
02139
1
t8
02139
2
t9, t10
02138
2
t11, t12
3
B
Figure 9 Intermediate stages of the core Datafly algorithm
Race
black
black
black
black
black
black
white
white
white
white
BirthDate
1965
1965
1965
1965
1964
1964
1964
1964
1967
1967
Gender
male
male
female
female
female
female
male
male
male
male
ZIP
02141
02141
02138
02138
02138
02138
02139
02139
02138
02138
Problem
short of breath
chest pain
painful eye
wheezing
obesity
chest pain
obesity
fever
vomiting
back pain
Figure 10 Table MGT resulting from Datafly, k=2, QI={Race, Birthdate, Gender, ZIP}
The core Datafly algorithm does not necessarily provide k-minimal generalizations
or k-minimal distortions, even though it can be proved that its solutions always
Page 14
L. Sweeney. Achieving k-anonymity privacy protection using generalization and suppression.
International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10 (5), 2002; 571588.
satisfy k-anonymity. One of the problems is that Datafly makes crude decisions –
generalizing all values associated with an attribute and suppressing all values
within a tuple. MinGen makes decisions at the cell-level and by doing so, can
provide results with more precision. Another problem is the heuristic that selects
the attribute with the greater number of distinct values as the one to generalize.
This may be computationally efficient, but can be shown to perform unnecessary
generalization. In summary, Datafly produces generalizations that satisfy kanonymity, but such generalizations may not be k-minimal distortions.
4.2. The µ-Argus System
In µ-Argus, the data holder provides a value for k and specifies which attributes
are sensitive by assigning a value between 0 and 3 to each attribute. These
correspond to "not identifying," "most identifying," "more identifying," and
"identifying," respectively. µ-Argus then identifies rare and therefore unsafe
combinations by testing 2- and 3-combinations of attributes. Unsafe combinations
are eliminated by generalizing attributes within the combination and by cell
suppression. Rather than removing entire tuples, µ-Argus suppresses values at the
cell-level. The resulting data typically contain all the tuples and attributes of the
original data, though values may be missing in some cell locations.
Figure 11 presents the µ-Argus algorithm. Given a table PT(Ax,…,Ay), a
quasi-identifier QI={A1,…,An}, where {A1,…,An} ⊆ {Ax,…,Ay}, disjoint subsets of
QI known as Identifying, More, and Most where QI = Identifying ∪ More ∪ Most,
a k-anonymity constraint, and domain generalization hierarchies DGHAi, µ-Argus
produces a table MT which is a generalization of PT[QI].
The basic steps of the µ-Argus algorithm are provided in Figure 11. This
algorithm results from my reverse engineering an implementation [9].
Shortcomings of the actual µ-Argus implementation were found. So, results from
both are reported. In general, the constructed µ-Argus algorithm generates
solutions that are better protected than those released by the actual program.
The program begins in step 1 by constructing freq, which is a frequency list
containing distinct sequences of values from PT[QI], along with the number of
occurrences of each sequence. In step 2, the values of each attribute are
automatically generalized until each value associated with an attribute in the
quasi-identifier QI appears at least k times. This is a necessary condition for kanonymity (see Theorem 1). In step 3, the program automatically tests
combinations of attributes to identify those combinations of attributes whose
assigned values in combination do not appear at least k times; these combinations
are stored in outliers. Afterwards, the data holder, in steps 4 and 5, decides
whether to generalize an attribute in QI that has values in outliers and if so, selects
the attribute to generalize. Finally, in step 6, µ-Argus automatically suppresses a
Page 15
L. Sweeney. Achieving k-anonymity privacy protection using generalization and suppression.
International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10 (5), 2002; 571588.
value from each combination in outliers.
Precedence is given to the value
occurring most often in order to reduce the total number of suppressions.
Input: Private Table PT; quasi-identifier QI = (A1, …, An),
disjoint subsets of QI known as Identifying, More, and
Most where QI = Identifying ∪ More ∪ Most, k constraint;
domain generalization hierarchies DGHAi, where i=1,…,n.
Output: MT containing a generalization of PT[QI]
Assumes: |PT |≥ k
Method:
1. freq ← a frequency list containing distinct sequences of values of PT[QI],
along with the number of occurrences of each sequence.
2. Generalize each Ai∈QI in freq until its assigned values satisfy k.
3. Test 2- and 3- combinations of Identifying, More and Most and let outliers
store those cell combinations not having k occurrences.
4. Data holder decides whether to generalize an Aj∈QI based on outliers and if
so, identifies the Aj to generalize. freq contains the generalized result.
5. Repeat steps 3 and 4 until the data holder no longer elects to generalize.
6. Automatically suppress a value having a combination in outliers, where
precedence is given to the value occurring in the most number of
combinations of outliers.
Figure 11 µ-Argus algorithm
One shortcoming of the actual µ-Argus implementation appears in step 3 of Figure
11. The actual program does not test all 2- and 3- combinations; this may be a
programming error. Figure 12 reports which combinations µ-Argus tests. Six
combinations (not listed) are not tested at all. It is easy to have tables in which
values appear in combinations not examined by µ-Argus.
Combinations Always Tested
Identifying × More × Most, Identifying × Most × Most, Most × Most × Most,
Identifying × More, Identifying × Most, More × Most, Most × Most
Combinations Tested only if |Identifying| > 1
More × More × Most, Most × Most × More, More × More
Figure 12 Combinations of More, Most, Identifying tested by µ-Argus
Let the hierarchies in Figure 2, Figure 3 and Figure 6 for the quasi-identifier
QI={Race, BirthDate, Gender, ZIP} where Most = {BirthDate}, More = {Gender,
ZIP} and Identifying = {Race}, the table PT in Figure 7, and a k-anonymity
constraint of k=2 be provided to µ-Argus. In Figure 13, V shows the result of
testing Most × More at step, and freq is updated to show {BirthDate, ZIP} for t8
did not satisfy k. Figure 14 shows freq before step 6; the values to be suppressed
are underlined. Table MT in Figure 15 is the final result from the µ-Argus
Page 16
L. Sweeney. Achieving k-anonymity privacy protection using generalization and suppression.
International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10 (5), 2002; 571588.
algorithm provided in Figure 11. The k-anonymity requirement is not enforced on
suppressed values, making it vulnerable to linking as well as to an inference attack
using summary data8. For example, knowing the total number of men and women
allows the suppressed values for Gender to be inferred.
Race Birth Sex
Birth ZIP occurs
1965
1965
1964
1965
1964
1967
02141
02138
02138
02139
02139
02138
2
2
3
1
2
2
black
black
black
white
white
white
white
sid
outliers
{t1,t2}
{}
{t3,t4}
{}
{t5,t6,t7}
{}
{t8}
{}
{t9,t10}
{}
{t11,t12}
{}
ZIP occurs
1965 male 02141
1965 female 02138
1964 female 02138
1964 male 02138
1965 female 02139
1964 male 02139
1967 male 02138
V
2
2
2
1
1
2
2
sid
outliers
{t1,t2}
{}
{t3,t4}
{}
{t5,t6}
{}
{t7}
{}
{{birth,zip}}
{t8}
{t9,t10}
{}
{t11,t12}
{}
freq
Figure 13 Most × More combination test and resulting freq
Race
Birth Sex
ZIP occurs sid
outliers
{}
{}
{}
{{birth,sex,zip},
{race,birth,zip}}
{t7}
{{birth,zip}, {sex,zip},
{birth,sex,zip},
{race,birth,sex},
{race,birth,zip}, {race,sex},
{race,birth}}
{t8}
{t9,t10} {}
{t11,t12} {}
black
black
black
1965
1965
1964
male 02141
female 02138
female 02138
2 {t1,t2}
2 {t3,t4}
2 {t5,t6}
white
1964
male
02138
1
white
white
white
1965
1964
1967
female 02139
male 02139
male 02138
1
2
2
Figure 14 freq before suppression
id
t1
t2
t3
t4
t5
t6
t7
t8
t9
t10
t11
t12
Race BirthDate Gender ZIP
black
1965
male 02141
black
1965
male 02141
black
1965
female 02138
black
1965
female 02138
black
1964
female 02138
black
1964
female 02138
white
male 02138
white
02139
white
1964
male 02139
white
1964
male 02139
white
1967
male 02138
white
1967
male 02138
id Race
t1
black
t2
black
t3
black
t4
black
BirthDate Gender
1965
male
1965
male
1965
female
1965
female
ZIP
02141
02141
02138
02138
t5
black
1964
female
t6
black
1964
female
02138
t7
white
1964
male
02138
t8
t9
white
white
1964
female
male
02139
02139
t10
t11
t12
white
white
white
1964
1967
1967
male
male
male
02139
02138
02138
MT
02138
MT actual
Figure 15 Results from the µ-Argus algorithm and from the program
The actual µ-Argus program provides MTactual shown in Figure 15. The tuple
identified as t7 is ["white", "1964", "male", "02138"], which is unique over
MTactual[QI]. Therefore, MTactual does not satisfy the requirement for k=2.
8
See [3] for a detailed discussion of attacks on k-anonymity.
Page 17
L. Sweeney. Achieving k-anonymity privacy protection using generalization and suppression.
International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10 (5), 2002; 571588.
A shortcoming of µ-Argus stems from not examining all combinations of the
attributes in the quasi-identifier. Only 2- and 3- combinations are examined.
There may exist 4-combinations or larger that are unique. Directly extending µArgus to compute on all combinations loses its computational efficiency. So,
generalizations from µ-Argus may not always satisfy k-anonymity, even though all
generalizations from Datafly do satisfy k-anonymity.
Both algorithms may provide generalizations that are not k-minimal
distortions because they both enforce generalization at the attribute level. This
renders crude Prec measures. There may exist values in the table that when
generalized at the cell level, satisfy k without modifying all values in the attribute.
In summary, more work is needed to correct these heuristic-based approaches.
Acknowledgments
Vicenc Torra and Josep Domingo provided the opportunity to write this paper.
Jon Doyle gave editorial suggestions on a very early draft. Pierangela Samarati
recommended I engage the material formally and started me in that direction in
1998. Finally, I thank the corporate and government members of the Laboratory
for International Data Privacy at Carnegie Mellon University for the opportunity
to work on real-world data anonymity problems.
References
1
2
3
4
5
6
7
8
9
L. Sweeney, Information Explosion. Confidentiality, Disclosure, and Data Access:
Theory and Practical Applications for Statistical Agencies, L. Zayatz, P. Doyle, J.
Theeuwes and J. Lane (eds), Urban Institute, Washington, DC, 2001.
L. Sweeney, Uniqueness of Simple Demographics in the U.S. Population, LIDAPWP4. Carnegie Mellon University, Laboratory for International Data Privacy,
Pittsburgh, PA: 2000. Forthcoming book entitled, The Identifiability of Data.
L. Sweeney. k-Anonymity: a model for protecting privacy. International Journal of
Uncertainty, Fuzziness and Knowledge-Based Systems, 10 (7), 2002.
T. Dalenius. Finding a needle in a haystack – or identifying anonymous census record.
Journal of Official Statistics, 2(3):329-336, 1986.
L. Sweeney. Guaranteeing anonymity when sharing medical data, the Datafly system.
Proceedings, Journal of the American Medical Informatics Association. Washington,
DC: Hanley & Belfus, Inc., 1997.
A. Hundepool and L. Willenborg. µ- and τ-argus: software for statistical disclosure
control. Third International Seminar on Statistical Confidentiality. Bled: 1996.
J. Ullman. Principles of Database and Knowledge Base Systems. Computer Science
Press, Rockville, MD. 1988.
L. Sweeney, Uniqueness of Simple Demographics in the U.S. Population, LIDAPWP4. Carnegie Mellon University, Laboratory for International Data Privacy,
Pittsburgh, PA: 2000. Forthcoming book entitled, The Identifiability of Data.
L. Sweeney, Computational Disclosure Control: A primer on data privacy protection.
Ph.D. Thesis, Massachusetts Institute of Technology, 2001.
Page 18
Download