Anonymization of Longitudinal Electronic Medical Records

advertisement
IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 16, NO. 3, MAY 2012
413
Anonymization of Longitudinal Electronic
Medical Records
Acar Tamersoy, Grigorios Loukides, Mehmet Ercan Nergiz, Yucel Saygin, and Bradley Malin
Abstract—Electronic medical record (EMR) systems have enabled healthcare providers to collect detailed patient information
from the primary care domain. At the same time, longitudinal data
from EMRs are increasingly combined with biorepositories to generate personalized clinical decision support protocols. Emerging
policies encourage investigators to disseminate such data in a deidentified form for reuse and collaboration, but organizations are
hesitant to do so because they fear such actions will jeopardize patient privacy. In particular, there are concerns that residual demographic and clinical features could be exploited for reidentification
purposes. Various approaches have been developed to anonymize
clinical data, but they neglect temporal information and are, thus,
insufficient for emerging biomedical research paradigms. This paper proposes a novel approach to share patient-specific longitudinal data that offers robust privacy guarantees, while preserving data utility for many biomedical investigations. Our approach
aggregates temporal and diagnostic information using heuristics
inspired from sequence alignment and clustering methods. We
demonstrate that the proposed approach can generate anonymized
data that permit effective biomedical analysis using several patient
cohorts derived from the EMR system of the Vanderbilt University
Medical Center.
Index Terms—Anonymization, data privacy, electronic medical
records (EMRs), longitudinal data.
I. INTRODUCTION
DVANCES in health information technology have facilitated the collection of detailed, patient-level clinical data
to enable efficiency, effectiveness, and safety in healthcare operations [1]. Such data are often stored in electronic medical record
(EMR) systems [2], [3] and are increasingly repurposed to support clinical research (see, e.g., [4]–[7]). Recently, EMRs have
been combined with biorepositories to enable genome-wide association studies (GWAS) with clinical phenomena in the hopes
of tailoring healthcare to genetic variants [8]. To demonstrate
feasibility, EMR-based GWAS have focused on static phenotypes; i.e., where a patient is designated as disease positive
A
Manuscript received April 7, 2011; revised September 27, 2011; accepted
January 16, 2012. Date of publication January 27, 2012; date of current version
May 4, 2012. This work was supported by the National Institutes of Health
under Grant U01HG006378, Grant U01HG006385, and Grant R01LM009989,
and by a Fellowship from the Royal Academy of Engineering Research.
A. Tamersoy and B. Malin are with the Department of Biomedical Informatics, Vanderbilt University, Nashville, TN 37232 USA (e-mail:
acar.tamersoy@vanderbilt.edu; b.malin@vanderbilt.edu).
G. Loukides is with the School of Computer Science and Informatics, Cardiff
University, Cardiff, CF24 3AA, U.K. (e-mail: g.loukides@cs.cf.ac.uk).
M. E. Nergiz is with the Department of Computer Engineering, Zirve University, Gaziantep 27260, Turkey (e-mail: mehmet.nergiz@zirve.edu.tr).
Y. Saygin is with the Department of Computer Science and Engineering, Sabanci University, Istanbul 34956, Turkey (e-mail: ysaygin@sabanciuniv.edu).
Digital Object Identifier 10.1109/TITB.2012.2185850
or negative (see, e.g., [9]–[11]). As these studies mature, they
will support personalized clinical decision support tools [12]
and will require longitudinal data to understand how treatment
influences a phenotype [13], [14].
Meanwhile, there are challenges to conducting GWAS on a
scale necessary to institute changes in healthcare. First, to generate appropriate statistical power, scientists may require access
to populations larger than those available in local EMR systems [15]. Second, the cost of a GWAS—incurred in the setup
and application of software to process medical records as well
as in genome sequencing—is nontrivial [16]. Thus, it can be
difficult for scientists to generate novel, or validate published,
associations. To mitigate this problem, the U.S. National Institutes of Health (NIH) encourages investigators to share data
from NIH-supported GWAS [17] into the Database of Genotypes and Phenotypes (dbGaP) [18].
This, however, may lead to privacy breaches if patients’ clinical or genomic information is associated with their identities. As
a first line of defense against this threat, the NIH recommends
investigators deidentify data by removing an enumerated list of
attributes that could identify patients (e.g., personal names and
residential addresses) prior to dbGaP submission [19]. However, a patients’ DNA may still be reidentified via residual demographics [20] and clinical information (e.g., standardized
International Classification of Diseases (ICD) codes) [21], as
illustrated in the following example.
Example 1: Each record in Fig. 1(a) corresponds to a fictional
deidentified patient and is comprised of ICD codes, patient’s age
when a code was received, and a DNA sequence. For instance,
the second record denotes that a patient was diagnosed with
benign essential hypertension (code 401.1) at ages 38 and 40 and
has the DNA sequence GC . . . A. The clinical and genomic data
are derived from an EMR system and a research project beyond
primary care, respectively. Publishing the data of Fig. 1(a) could
allow a hospital employee with access to the EMR to associate
Jane with her DNA sequence. This is because the identified
record, shown in Fig. 1(b), can only be linked to the second
record in Fig. 1(a) based on the ICD code 401.1 and ages 38
and 40.
Methods to mitigate reidentification via demographic and
clinical features [22], [23] have been proposed, but they are not
applicable to the longitudinal scenario. These methods assume
the clinical profile is devoid of temporal or replicated diagnosis
information. Consequently, these methods produce data that are
unlikely to permit meaningful longitudinal investigations.
In this paper, we propose the first approach to formally
anonymize longitudinal patient records. Our work makes the
following specific contributions.
1089-7771/$31.00 © 2012 IEEE
414
Fig. 1.
IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 16, NO. 3, MAY 2012
Longitudinal data privacy problem. (a) Longitudinal research data. (b) Identified EMR. (c) 2-anonymization based on the proposed approach.
1) We propose a framework to transform each longitudinal
patient record into a form that is indistinguishable from at
least k − 1 other records. This is achieved by iteratively
clustering records and applying generalization, which replaces ICD codes and age values with more general values,
and suppression, which removes ICD codes and age values. For example, applying our approach with k = 2 to
the data of Fig. 1(a) will generate the anonymized data
of Fig. 1(c). Observe that Jane’s record is now linked to
two DNA sequences because the diagnosis code 401.1 in
the first pair of this record has been replaced by the more
general code 401.
2) We evaluate our approach with several cohorts of patient
records from the Vanderbilt University Medical Center
(VUMC) EMR system. Our results demonstrate that the
anonymized data produced allow many studies focusing
on clinical case counts to be performed accurately.
The remainder of this paper is organized as follows. In
Section II, we review related research on anonymization and
its application to biomedical data. In Section III, we formalize
the notions of privacy and utility and the anonymization problem
addressed in this paper. We present our anonymization framework and discuss its extensions and limitations in Sections IV
and VI, respectively. Finally, Section V reports the experimental
results and Section VII concludes the paper.
A. Relational Data
We first review methods for preventing reidentification in
relational data (e.g., demographics), where records have a fixed
number of attributes and one value per attribute.
The first category of protection methods transforms attribute
values so that they no longer correspond to real individuals.
Popular approaches in this category are noise addition, data
swapping, and synthetic data generation (see [30]–[32] for surveys). While such methods generate data that preserve aggregate
statistics (e.g., the average age), they do not guarantee data that
can be analyzed at the record level. This is a significant limitation
that hampers the ability to use these data in various biomedical
studies, including epidemiological studies [33] and GWAS [22].
In contrast, methods based on generalization and suppression
preserve data truthfulness [34]–[36]. Many of these methods
are based on a principle called k-anonymity [24], [34], which
states that each record of the published data must be equivalent
to at least k − 1 other records with respect to quasi-identifiers
(QI) (i.e., attributes that can be linked with external resources
for reidentification purposes) [37]. To enhance the utility of the
anonymized data, these methods employ various search strategies, including binary search [34], [35], clustering [38], [39],
evolutionary search [40], and partitioning [36]. There are methods that have been successfully applied to biomedical data [35],
[41].
B. Transactional Data
II. RELATED RESEARCH
Reidentification concerns for clinical data via seemingly innocuous attributes were first raised in [24]. Specifically, it was
shown that patients could be uniquely reidentified by linking
publicly available voter registration lists to hospital discharge
summaries via demographics, such as date of birth, gender, and
five-digit residential zip code. The reidentification phenomenon
has since attracted interest in domains beyond healthcare, and
numerous techniques to guard against attacks have emerged
(see [25] and [26] for surveys). In this section, we survey research related to privacy-preserving data publishing, with a focus on biomedical data. We note that the reidentification problem
is not addressed by access control and encryption-based methods [27]–[29] because data need to be shared beyond a small
number of authorized recipients.
Next, we turn our attention to approaches that deal with more
complex data. Specifically, we consider transactional data, in
which records have a large and variable number of values per
attribute (e.g., diagnosis codes assigned to a patient during a
hospital visit). Transactional data can also facilitate reidentification in the biomedical domain. For instance, deidentified
clinical records can be linked to patients based on combinations
of diagnosis codes that are additionally contained in publicly
available hospital discharge summaries and EMR systems from
which the records have been derived [21]. As it was shown
in [21], more than 96% of 2700 patient records, collected in
the context of a GWAS, would be susceptible to reidentification
based on diagnosis codes if shared without additional controls.
From the protection perspective, there are several approaches
that have been developed to anonymize transactional data.
TAMERSOY et al.: ANONYMIZATION OF LONGITUDINAL ELECTRONIC MEDICAL RECORDS
For instance, Terrovitis et al. [42] proposed k m -anonymity, a
principle, and several heuristic algorithms to prevent attackers
from linking an individual to less than k records. This model assumes that the adversary knows at most m attribute values of any
transaction. To anonymize patient records in transactional form,
Loukides et al. [22] introduced a privacy principle to ensure
that sets of potentially identifying diagnosis codes are protected
from reidentification, while remaining useful for GWAS validations. To enforce this principle, they proposed an algorithm
that employs generalization and suppression to group semantically close diagnosis codes together in a way that enhances data
utility [22], [23]. Additionally, Tamersoy et al. [43] considered
protecting data in which a certain diagnosis code may occur
multiple times in a patient record. They designed an algorithm
which preserves patients’ privacy through suppressing a subset
of the replications of a diagnosis code.
Our work differs from the aforementioned research along two
principal dimensions. First, we address reidentification in longitudinal data publishing. Second, contrary to the approaches
in [22] and [23] which group diagnosis codes together, our
framework is based on grouping of records, which has been
shown to be highly effective in retaining data utility due to the direct identification of records being anonymized [38], [39], [44].
C. Spatiotemporal Data
Spatiotemporal data are related to the problem studied in
this paper. They are time and location dependent, and these
unique characteristics make them challenging to protect against
reidentification. Such data are typically produced as a result of
queries issued by mobile subscribers to location-based service
providers, who, in turn, supply information services based on
specific physical locations.
The principle of k-anonymity has been extended to
anonymize spatiotemporal data. Abul et al. [45] proposed a
technique to group at least k objects that correspond to different subscribers and appear within a certain radius of the path
of every object in the same time period. In addition to generalization and suppression, Abul et al. [45] considered adding
noise to the original paths so that objects appear at the same
time and spatial trajectory volume. Assuming that the locations of subscribers constitute sensitive information, Terrovitis
and Mamoulis [46] proposed a suppression-based methodology to prevent attackers from inferring these locations. Finally,
Nergiz et al. [44] proposed an approach that employs kanonymity, enforced using generalization, together with reconstruction to protect data against boundary-based attacks. Our
heuristics are inspired from [44]; however, we employ both generalization and suppression to further enhance data utility, and
we do not use reconstruction, so as to preserve data truthfulness.
The aforementioned approaches are developed for anonymizing spatiotemporal data and cannot be applied to longitudinal
data due to different semantics. Specifically, the data we consider
record patients’ diagnoses and not their locations. Consequently,
the objective of our approach is not to hide the locations of patients, but to prevent reidentification based on their diagnosis
and time information.
Fig. 2.
415
General architecture of the longitudinal data anonymization process.
III. BACKGROUND AND PROBLEM FORMULATION
This section begins with a high-level overview of the proposed approach. Next, we present the notation and the definitions for the privacy and adversarial models, the data transformation strategies, and the information loss metrics. We conclude
the section with a formal problem description.
A. Architectural Overview
Fig. 2 provides an overview of the data anonymization process. The process is initiated when the data owner supplies
the following information: 1) a dataset of longitudinal patient
records, each of which consists of (ICD, Age) pairs and a
DNA sequence, and 2) a parameter k that expresses the desired
level of privacy. Given this information, the process invokes our
anonymization framework. To satisfy the k-anonymity principle, our framework forms clusters of at least k records of the
original dataset, which are modified using generalization and
suppression.
B. Notation
A dataset D consists of longitudinal records of the form
T, DN AT , where T is a trajectory1 and DN AT is a genomic
sequence. Each trajectory corresponds to a distinct patient in D
and is a multiset2 of pairs (i.e., T = {t1 , . . . , tm }) drawn from
two attributes, namely ICD and Age [i.e., ti = (u ∈ ICD, v ∈
Age)], which contain the diagnosis codes assigned to a patient
and their age, respectively. |D| denotes the number of records
in D and |T | the length of T , defined as the number of pairs
in T . We use the “.” operator to refer to a specific attribute
value in a pair (e.g., ti .icd or ti .age). To study the data temporally, we order the pairs in T with respect to Age, such that
ti−1 .age ≤ ti .age.
C. Adversarial Model
We assume that an adversary has access to the original dataset
D, such as in Fig. 1(a). An adversary may perform a reidentification attack in several ways.
1) Using identified EMR data: The adversary links D with
the identified EMR data, such as those of Fig. 1(b), based
1 We use the term trajectory since the diagnosis codes at different ages can be
seen as a route for the patient throughout his/her life.
2 Contrary to a set, a multiset can contain an element more than once.
416
Fig. 3.
IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 16, NO. 3, MAY 2012
Example of the DGH structure for Age.
on (ICD, Age) pairs. This scenario requires the adversary
to have access to the identified EMR data, which is the
case of an employee of the institution from which the
longitudinal data were derived.
2) Using publicly available hospital discharge summaries
and identified resources: The adversary first links D with
hospital discharge summaries based on (ICD, Age) pairs
to associate patients with certain demographics. In turn,
these demographics are exploited in another linkage with
public records, such as voter registration lists, which contain identity information.
Note that in both cases, an adversary is able to link patients
to their DNA sequences, which suggests that a formal approach
to longitudinal data anonymization is desirable.
D. Privacy Model
The formal definition of k-anonymity in the longitudinal data
context is provided in Definition 1. Since each trajectory often contains multiple (ICD, Age) pairs, it is difficult to know
which can be used by an adversary to perform reidentification
attacks. Thus, we consider the worst-case scenario in which any
combination of (ICD, Age) pairs can be exploited. Regardless,
k-anonymity limits an adversary’s ability to perform reidentification based on (ICD, Age) pairs, because each trajectory is
associated with no less than k patients.
Definition 1 (k-Anonymity): An anonymized dataset D̃, produced from D, is k-anonymous if each trajectory in D̃, projected
over QI, appears at least k times for any QI in D.
E. Data Transformation Strategies
Generalization and suppression are typically guided by a domain generalization hierarchy (DGH) (Definition 2) [47].
Definition 2 (DGH): A DGH for attribute A, referred to as
HA , is a partially ordered tree structure which defines valid
mappings between specific and generalized values of A. The
root of HA is the most generalized value of A and is returned
by a function root.
Example 2: Consider HAge in Fig. 3. The values in the domain
of Age (i.e., 33, 34, . . . , 40) form the leaves of HAge . These
values are then mapped to two, to four, and eventually to eightyear intervals. The root of HAge is returned by root(HAge ) as
[33 − 40].
Our approach does not impose any constraints on the structure
of an attribute’s DGH, such that the data owners have complete
freedom in its design. For instance, for ICD codes, data owners
can use the standard ICD-9-CM hierarchy.3 For ages, data owners can use a predefined hierarchy (e.g., the age hierarchy in the
HIPAA Safe Harbor Policy4 ) or design a DGH manually.5
According to Definition 3, each specific value of an attribute
generalizes to its direct ancestor in a DGH. However, a specific
value can be projected up multiple levels in a DGH via a sequence of generalizations. As a result, a generalized value Ai is
interpreted as any one of the leaf nodes in the subtree rooted by
Ai in HA .
Definition 3 (Generalization and Suppression): Given a node
Ai = root(HA ) in HA , generalization is performed using a
function f :Ai → Aj which replaces Ai with its direct ancestor Aj . Suppression is a special case of generalization and is
performed using a function g:Ai → Ar which replaces Ai with
root(HA ).
Example 3: Consider the last trajectory in Fig. 1(c). The first
pair (401.1, [39 − 40]) is interpreted as either (401.1, 39) or
(401.1, 40).
F. Information Loss
Generalization and suppression incur information loss because values are replaced by more general ones or eliminated.
To capture the amount of information loss incurred by these operations, we quantify the normalized loss for each ICD code and
Age value in a pair based on the Loss Metric (LM) (Definition
4) [40].
Definition 4 (LM): The information loss incurred by replacing
a node Ai with its ancestor Aj in HA is
A
j − Ai
LM(Ai , Aj ) =
|A|
(1)
where A
i and Aj denote the number of leaf nodes in the subtree
rooted by Ai and Aj in HA , respectively, and |A| denotes the
domain size of attribute A.
Example 4: Consider HAge in Fig. 3. The information loss
incurred by generalizing [33 − 34] to [33 − 36] is 4−2
8 = 0.25
because the leaf-level descendants of [33 − 34] are 33 and 34,
those of [33 − 36] are 33, 34, 35, and 36, and the domain of Age
consists of the values 33–40.
To introduce the combined LM, which captures the total LM
of replacing two nodes with their ancestor, provided in Definition 6, we use the notation of lowest common ancestor (LCA),
provided in Definition 5.
Definition 5 (LCA): The LCA A of nodes Ai and Aj in HA is
the farthest node (in terms of height) from root(HA ) such that (1)
Ai = A or f n (Ai ) = A and (2) Aj = A orf m (Aj ) = A ,
and is returned by a function lca.
Definition 6 (Combined LM): The combined LM of replacing
nodes Ai and Aj with their LCA A is
LM (Ai + Aj , A ) = LM (Ai , A ) + LM (Aj , A ).
3 More
(2)
information is available at http://www.cdc.gov/nchs/icd.htm
can be retained intact, while 90
or greater must be grouped together.
5 We further note that our approach is extendible to other categorical attributes,
such as SNOMED-CT and Date, provided that a DGH can be specified for each
of the attributes. Such extensions, however, are beyond the scope of this paper.
4 HIPAA Safe Harbor states all ages under 89
TAMERSOY et al.: ANONYMIZATION OF LONGITUDINAL ELECTRONIC MEDICAL RECORDS
417
Next, we define the LM for an anonymized trajectory (Definition 7) and dataset (Definition 8), which we keep separate for
each attribute.
Definition 7 (LM for an Anonymized Trajectory): Given an
anonymized trajectory T̃ and an attribute A, the LM with respect
to A is computed as
˜
LM(T̃ , A) =
|T |
LM(t̃i .A, t̃∗i .A)
(3)
i=1
where t̃∗i ·A denotes the value t̃i ·A is replaced with.
Definition 8 (LM for an Anonymized Dataset): Given an
anonymized dataset D̃ and an attribute A, the LM with respect
to attribute A is computed as
LM(D̃, A) =
1 LM(T̃ , A)
.
|D̃| ˜
|T̃ |
(4)
T ∈D̃
For clarity, we refer to an LM related to ICD and Age by
ILM and ALM, respectively (e.g., we use ILM(D̃) instead of
LM(D̃, ICD)).
G. Problem Statement
The longitudinal data anonymization problem is formally defined as follows.
Problem: Given a longitudinal dataset D, a privacy parameter k, and DGHs for attributes ICD and Age, construct an
anonymized dataset D̃, such that 1) D̃ is k-anonymous, 2) the
order of the pairs in each trajectory of D is preserved in D̃, and
3) ILM(D̃) + ALM(D̃) is minimized.
IV. ANONYMIZATION FRAMEWORK
In this section, we present our framework for longitudinal
data anonymization.
Many clustering algorithms can be applied to produce kanonymous data [48], [49]. This involves organizing records
into clusters of size at least k, which are anonymized together.
In the context of longitudinal data, the challenge is to define a
distance metric for trajectories such that a clustering algorithm
groups similar trajectories. We define the distance between two
trajectories as the cost (i.e., incurred information loss) of their
anonymization as defined by the LM. The problem then reduces
to finding an anonymized version T̃ of two given trajectories
such that ILM(T̃ ) + ALM(T̃ ) is minimized.
Finding an anonymization of two trajectories can be achieved
by finding a matching between the pairs of trajectories that
minimizes their cost of anonymization. This problem, which
is commonly referred to as sequence alignment, has been extensively studied in various domains, notably for the alignment
of DNA sequences to identify regions of similarity in a way
that the total pairwise edit distance between the sequences is
minimized [50], [51].
To solve the longitudinal data anonymization problem, we
propose Longitudinal Data Anonymizer, a framework that incorporates alignment and clustering as separate components, as
shown in Fig. 2. The objective of each component is summarized
as follows.
1) Alignment attempts to find a minimal cost pair matching
between two trajectories.
2) Clustering interacts with the Alignment component to create clusters of at least k records.
Next, we examine each component in detail and develop
methodologies to achieve their objectives.
A. Alignment
There are no directly comparable approaches to the method
we developed in this study. So, we introduce a simple heuristic, called Baseline, for comparison purposes. Given trajectories X = {x1 , . . . , xm } and Y = {y1 , . . . , yn }, ILM(X) and
ALM(X), and DGHs HICD and HAge , Baseline aligns X and
Y by matching their pairs on the same index.6
The pseudocode for Baseline is provided in Algorithm 1. This
algorithm initializes an empty trajectory T̃ to hold the output
of the alignment and then assigns ILM(X) and ALM(X) to
variables i and a, respectively (steps 1 and 2). Then, it determines the length of the shorter trajectory (step 3) and performs
pair matching (steps 4–9). Specifically, for the pairs of the trajectories that have the same index, Baseline constructs a pair
containing the LCAs of the ICD codes and Age values in these
pairs (step 5), appends the constructed pair to T̃ (step 6), and
updates i and a with the information loss incurred by the generalizations (steps 7–8). Next, Baseline updates i and a with
the amount of information loss incurred by suppressing the ICD
codes and Age values from the unmatched pairs in the longer
trajectory (steps 10–14). Finally, this algorithm returns T̃ along
with i and a, which correspond to ILM(T̃ ) and ALM(T̃ ), respectively (step 15).
6 ILM(X ) and ALM(X ) are provided as input because X may already be an
anonymized version of two other trajectories.
418
IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 16, NO. 3, MAY 2012
To help preserve data utility, we provide Alignment using
Generalization and Suppression (A-GS), an algorithm that uses
dynamic programming to construct an anonymized trajectory
that incurs minimal cost.
Before discussing A-GS, we briefly discuss the application
of dynamic programming. The latter technique can be used to
solve problems based on combining the solutions to subproblems which are not independent and share subsubproblems [52].
A dynamic programming algorithm stores the solution of a subsubproblem in a table to which it refers every time the subsubproblem is encountered. To give an example, for trajectories
X = {x1 , . . . , xm } and Y = {y1 , . . . , yn }, a subproblem may
be to find a minimal cost pair matching between the first to
the jth pairs. A solution to this subproblem can be determined
using solutions for the following subsubproblems and applying
the respective operations:
1) Align X = {x1 , . . . , xj −1 } and Y = {y1 , . . . , yj −1 }, and
generalize xj with yj ;
2) Align X = {x1 , . . . , xj −1 } and Y = {y1 , . . . , yj }, and
suppress xj ;
3) Align X = {x1 , . . . , xj } and Y = {y1 , . . . , yj −1 }, and
suppress yj .
Each case is associated with a cost. Our objective is to find
an anonymized trajectory T̃ , such that ILM(T̃ ) + ALM(T̃ ) is
minimized, so we examine each possible solution and select the
one with minimum information loss.
A-GS uses a similar approach to align trajectories. The algorithm accepts the same inputs as Baseline as well as weights
wICD and wAge . The latter allow A-GS to control the information loss incurred by anonymizing the values of each attribute. The data owners specify the attribute weights such that
wICD ≥ 0, wAge ≥ 0, and wICD + wAge = 1. The pseudocode
for A-GS is provided in Algorithm 2.
In step 1, A-GS initializes three matrices: i, a, and r. The first
row (indexed 0) of each of these matrices corresponds to a null
value, and starting from index 1, each row corresponds to a value
in X. Similarly, the first column (indexed 0) of each of these
matrices corresponds to a null value, and starting from index
1, each column corresponds to a value in Y . Specifically, for
indices h and j, rh,j records which of the following operations
incurs minimum information loss: 1) generalizing xh and yj
(denoted with <>), 2) suppressing xh (denoted with <↑>),
and 3) suppressing yj (denoted with <←>). The entries in
ih,j and ah,j keep the total ILM and ALM for aligning the
subtrajectories Xsub = {x1 , . . . , xh } and Ysub = {y1 , . . . , yj },
respectively.
In step 2, A-GS assigns ILM(X) and ALM(X) to i0,0 and
a0,0 , respectively. We include null values in the rows and
columns of i, a, and r because at some point during alignment
A-GS may need to suppress some portion of the trajectories.
Therefore, in steps 3–7 and 8–12, A-GS initializes i, a, and r for
the values in X and Y , respectively. Specifically, for indices h
and j, ih,0 and i0,j keep the ILM for suppressing every pair in the
subtrajectories Xsub = {x1 , . . . , xh } and Ysub = {y1 , . . . , yj },
respectively. Similar reasoning applies to matrix a. The first row
and column of r holds <↑> and <←> for suppressing values
from X and Y , respectively.
In steps 13–25, A-GS performs dynamic programming.
Specifically, for indices h and j, A-GS determines a minimal
cost pair matching of the subtrajectories Xsub = {x1 , . . . , xh }
and Ysub = {y1 , . . . , yj } based on the three cases listed previously. Specifically, in steps 15–21, A-GS constructs two temporary arrays, c and g, to store the ILM and ALM for each
possible solution, respectively. Next, in steps 22 and 23, AGS determines the solution with the minimum information loss
and assigns the ILM, ALM, and operation associated with the
TAMERSOY et al.: ANONYMIZATION OF LONGITUDINAL ELECTRONIC MEDICAL RECORDS
Fig. 4.
419
Example of the hypertension subtree in the ICD DGH.
Fig. 5. Matrices i, a, and r for the subset of records from Fig. 1. This alignment
uses the DGHs in Figs. 3 and 4 and assumes that w IC D = w A g e = 0.5.
solution to ih,j , ah,j , and rh,j , respectively. If there is a tie
between the solutions, A-GS selects generalization as the operation for the sake of retaining more information.
In steps 26–36, A-GS constructs the anonymized trajectory
T̃ by traversing the matrix r. Specifically, for two pairs in the
trajectories, if generalization incurs minimum information loss,
A-GS appends to T̃ a pair containing the LCAs of the ICD
codes and Age values in these pairs. The unmatched pairs in
the trajectories are ignored during this process because A-GS
suppresses these pairs. Finally, in step 37, Baseline returns T̃
along with im ,n and am ,n , which correspond to ILM(T̃ ) and
ALM(T̃ ), respectively.
Example 5: Consider applying A-GS to T1 and T4 in Fig. 1(a)
using the DGHs shown in Figs. 3 and 4 and assuming that
wICD = wAge = 0.5. The matrices i, a, and r are illustrated
in Fig. 5. As T1 and T4 are not anonymized, we initialize
i0,0 = a0,0 = 0. Subsequently, A-GS computes the values for
the entries in the first row and column of the matrices. For instance, i0,3 keeps the ILM for suppressing all ICD codes from T1
and has a value of 1 + (1 ∗ 0.5) = 1.5. This is computed by summing the ILM for suppressing the first two ICD codes (i.e., the
value stored in i0,2 ) with the weight-adjusted ILM for suppressing the third ICD code. Then, A-GS performs dynamic programming. The process starts with aligning T1,sub = {(401.1, 33)}
and T4,sub = {(401.9, 33)}. The possible solutions for this subproblem are as follows.
1) Align T1,sub = {∅} and T4,sub = {∅}, and generalize
401.1 with 401.9 and 33 with 33.
2) Align T1,sub = {(401.1, 33)} and T4,sub = {∅}, and suppress 401.9 and 33.
3) Align T1,sub = {∅} and T4,sub = {(401.9, 33)}, and suppress 401.1 and 33.
The ILM and ALM for the subsubproblem in the first solution
are stored in i0,0 and a0,0 , respectively. Generalizing 401.1 with
401.9 has an ILM of (1 + 1) ∗ 0.5 = 1, and generalizing 33 with
33 has an ALM of 0. Therefore, the first solution has a total LM
of 1. The ILM and ALM for the subsubproblem in the second
solution are stored in i0,1 and a0,1 , respectively. The suppression
of 401.9 and 33 has an ILM and ALM of 1 ∗ 0.5 = 0.5. It can
be seen that the second and third solutions each have a total LM
of 2. The solution with the minimum information loss is the first
one; hence, A-GS stores 1, 0 and <> in i1,1 , a1,1 and r1,1 ,
respectively. After the values for the remaining entries are computed, A-GS uses r to construct the anonymized trajectory T̃ .
The process starts with examining the bottom-right entry, which
denotes a generalization. As a result, A-GS appends (401.1, 35)
to T̃ . The process continues by following the symbols, such
that A-GS returns T̃ = {(401.1, 33), (401.1, 34), (401.1, 35)}
along with i4,3 and a4,3 , which correspond to ILM(T̃ ) and
ALM(T̃ ), respectively.
B. Clustering
We base our methodology for the clustering component on the
maximum distance to average vector (MDAV) algorithm [53],
[54], an efficient heuristic for k-anonymity. The pseudocode
for MDAV7 and its helper function, formCluster, are provided
in Algorithms 3 and 4, respectively. MDAV iteratively selects
the most frequent trajectory in a longitudinal dataset (steps 3
and 12), finds its most distant trajectory X (steps 4 and 13),
and forms a cluster of k trajectories around the latter trajectory
7 We
refer to our algorithm as MDAV to avoid confusion.
420
IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 16, NO. 3, MAY 2012
TABLE I
DESCRIPTIVE SUMMARY STATISTICS OF THE DATASETS
(steps 5 and 6 and 14 and 15). Cluster formation is performed
by formCluster, a function that constructs a cluster C by aligning n trajectories in a consecutive manner and returns C, along
with ILM(C) and ALM(C) which are the ILM and ALM for
the anonymized trajectory resulting from the alignment, respectively. MDAV minimizes intercluster similarity by constructing
a cluster around a trajectory Y that is most distant to X (steps
7–9). We define the distance between two trajectories as the cost
of their anonymization. As such, the most distant trajectory is
the one that maximizes the sum of ILM and ALM returned from
A-GS.
A similar reasoning applies when we form a cluster. We add
the trajectory that minimizes the sum of ILM and ALM returned
from A-GS. MDAV forms a final cluster of size at least k
using the remaining trajectories in the dataset (steps 17–19) and
returns D̃, a k-anonymized version of the longitudinal dataset,
along with ILM(D̃) and ALM(D̃) (step 20).
V. EXPERIMENTAL EVALUATION
This section presents an experimental evaluation of the
anonymization framework. We compare the anonymization
methods on data utility, as indicated by the LM measure (see
Section V-B) and aggregate query answering accuracy (see
Section V-C). Furthermore, we show that our method allows
balancing the level of information loss incurred by anonymizing
ICD codes and Age values (see Section V-D). This is important
to support different types of biomedical studies, such as geriatric
and epidemiology studies that are supported “well” when the information contained in Age and ICD attributes, respectively, is
preserved in the anonymized data.
A. Experimental Setup
We worked with three datasets derived from the Synthetic
Derivative (SD), a collection of deidentified information extracted from the EMR system of the VUMC [55]. We issued a
query to retrieve the records of patients whose DNA samples
were genotyped and stored in BioVU, VUMC’s DNA repository linked to the SD. Then, using the phenotype specification
in [56], we identified the patients eligible to participate in a
GWAS on native electrical conduction within the ventricles of
Pop
by
the heart. Subsequently, we created a dataset called D50
Fig. 6.
Comparison of information loss for D 5P0o p using various k values.
restricting our query to the 50 most frequent ICD codes that
occur in at least 5% of the records in BioVU. Next, we created a
Pop
, containing the
dataset called D4Pop , which is a subset of D50
following comorbid ICD codes selected for [43]: 250 (diabetes
mellitus), 272 (disorders of lipoid metabolism), 401 (essential
hypertension), and 724 (other and unspecified disorders of the
back). Finally, we created a dataset called D4Sm p , which is a
subset of D4Pop , containing the records of patients who actually
participated in the aforementioned GWAS [57]. D4Sm p is expected to be deposited into the dbGaP repository and has been
used in [43] with no temporal information. The characteristics
of our datasets are summarized in Table I.
Throughout our experiments, we varied k between 2 and 15,
noting that k = 5 tends to be applied in practice [41]. Initially,
we set wICD = wAge = 0.5. We implemented all algorithms
in Java and conducted our experiments on an Intel 2.8 GHz
powered system with 4-GB RAM.
B. Capturing Data Utility Using LM
We first compared the algorithms with respect to the LM.
Pop
. The ILM and ALM inFig. 6 depicts the results with D50
crease with k for both algorithms, which is expected because
as k increases, a larger amount of distortion is needed to satisfy a stricter privacy requirement. Note that Baseline incurred
substantially more information loss than A-GS for all k. In
fact, Baseline failed to construct a practically useful result when
k > 2, as it suppressed all values from the dataset. Similar trends
were observed between A-GS and Baseline for D4Pop and D4Sm p
(omitted for brevity).
Interestingly, A-GS, on average, incurred 48% less informaPop
. This is important because a relation loss on D4Pop than D50
tively small number of ICD codes may suffice to study a range
of different diseases [11], [43]. It is also worthwhile to note that
the information loss incurred by our approach remains relatively
low (i.e., below 0.5) for D4Sm p , even though it is, on average,
55% more than that for D4Pop . This is attributed to the fact that
TAMERSOY et al.: ANONYMIZATION OF LONGITUDINAL ELECTRONIC MEDICAL RECORDS
Fig. 7. Comparison of query answering accuracy for D 5P0o p , D 4P o p , and D 4S m p
using various k values.
D4Sm p is more sparse than D4Pop , which implies it is more difficult to anonymize [58]. The ILM and ALM for different k values
and datasets, which correspond to the aforementioned results,
can be found in Appendix A.
421
Fig. 8. A comparison of information loss for D 5P0o p using various w IC D and
w A g e values.
result implies that data managers can use weights to achieve
desired utility for either attribute.
VI. DISCUSSION
C. Capturing Data Utility Using Average Relative Error
We next analyzed the effectiveness of our approach for supporting general biomedical analysis. We assumed a scenario in
which a scientist issues queries on anonymized data to retrieve
the number of trajectories that harbor a combination of (ICD,
Age) pairs that appear in at least 1% of the original trajectories. Such queries are typical in many biomedical data mining
applications [59]. To quantify the accuracy of answering such
a workload of queries, we used the Average Relative Error
(AvgRE) measure [36], which reflects the average number of
trajectories that are incorrectly included as part of the query
answers. Details about this measure are in Appendix B.
Fig. 7 shows the AvgRE scores of running A-GS on the
datasets. The results for Baseline are not reported because they
were more than 6 times worse than our approach for k = 2,
and the worst possible for k > 2. This is because Baseline suppressed all values. As expected, we find an increase in AvgRE
scores as k increases, which is due to the privacy/utility tradeoff.
Nonetheless, A-GS allows fairly accurate query answering on
each dataset by having an AvgRE score of less than 1 for all k.
The results suggest our approach can be effective, even when
a high level of privacy is required. Furthermore, we observe
that the AvgRE scores for D4Pop are lower than those of D4Sm p ,
Pop
. This implies that
which are in turn lower than those of D50
query answers are more accurate for small domain sizes and
large datasets.
D. Prioritizing Attributes
Finally, we investigated how configurations of attribute
weighting affect information loss. Fig. 8 reports the results for
Pop
and k = 2 when our algorithm is configured with weights
D50
ranging from 0.1 to 0.9. Observe that when wICD = 0.1 and
wAge = 0.9, A-GS distorted Age values much less than ICD
values. Similarly, A-GS incurred less information loss for ICD
than Age when we specified wICD = 0.9 and wAge = 0.1. This
In this section, we discuss how our approach can be extended
to prevent a privacy threat in addition to reidentification and its
limitations.
A. Attacks Beyond Reidentification
Beyond reidentification is the threat of sensitive itemset disclosure, in which a patient is associated with a set of diagnosis
codes that reveal some sensitive information (e.g., HIV status).
k-Anonymity does not guarantee prevention of sensitive itemset
disclosure, since a large number of records that are indistinguishable with respect to the potentially identifying diagnosis codes
can still contain the same sensitive itemset [59]. We note that
our approach can be extended to prevent this attack by controlling generalization and suppression to ensure that an additional
principle is satisfied, such as -diversity [60], which dictates
how sensitive information is grouped. This extension, however,
is beyond the scope of this paper.
B. Limitations
The proposed approach is limited in certain aspects, which
we highlight to suggest opportunities for further research. First,
our algorithm induces minimal distortion to the data in practice,
but it does not limit the amount of information loss incurred by
generalization and suppression. Designing algorithms that provide this type of guarantee is important to enhance the quality of
anonymized GWAS-related datasets, but is also computationally
challenging due to the large search spaces involved, particularly
for longitudinal data. Second, the approach we propose does
not guarantee that the released data remain useful for scenarios in which prespecified analytic tasks, such as the validation
of known GWAS [22], are known to data owners a priori. To
address such scenarios, we plan to design algorithms that take
the tasks for which data are anonymized into account during
anonymization.
422
IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 16, NO. 3, MAY 2012
VII. CONCLUSION AND FUTURE WORK
This study was motivated by the growing need to disseminate patient-specific longitudinal data in a privacy-preserving
manner. To the best of our knowledge, we introduced the first
approach to sharing such data while providing computational
privacy guarantees. Our approach uses sequence alignment and
clustering-based heuristics to anonymize longitudinal patient
records. Our investigations suggest that it can generate longitudinal data with a low level of information loss and remain useful
for biomedical analysis. This was illustrated through extensive
experiments with data derived from the EMRs of thousands of
patients. The approach is not guided by specific utility (e.g.,
satisfaction of GWAS validation), but we are confident it can be
extended to support such endeavors.
APPENDIX A
INFORMATION LOSS INCURRED BY ANONYMIZING
THE DATASETS USING A-GS
pair (401.1, [39 − 40]). Furthermore, observe that 401.1 is a
leaf node in Fig. 4; hence, the set of possible ICD codes is
{401.1}. Similarly, the subtree rooted by [39 − 40] in Fig. 3
consists of two leaf nodes, hence the set of possible Age
values is {39, 40}. Therefore, there are two possible pairs:
{(401.1, 39), (401.1, 40)}, and the probability that one of the
trajectories was originally harboring (401.1, 39) is 12 . Then, an
approximate answer for the query is computed as 12 × 2 = 1.
The Relative Error (RE) for an arbitrary query Q is computed as RE(Q) = |a(Q)−e(Q)|/a(Q). For instance, the RE for
the previous example query is |1 − 1|/1 = 0 since the original
dataset in Fig. 1(a) contains one trajectory with (401.1, 39).
The AvgRE for a workload of queries is the mean RE of all
issued queries. It reflects the mean error in answering the query
workload.
ACKNOWLEDGMENT
The authors would like to thank A. Gkoulalas-Divanis (IBM
Research) for helpful discussions on the formulation of this
problem.
REFERENCES
APPENDIX B
MEASURING DATA UTILITY USING QUERY WORKLOADS
The AvgRE measure captures the accuracy of answering
queries on an anonymized dataset. The queries we consider
can be modeled as follows:
Q: SELECT COUNT(*)
FROM dataset
WHERE (u ∈ ICD, v ∈ Age) ∈ dataset, . . .
Let a(Q) be the answer of a COUNT() query Q when it is
issued on the original dataset. The value of a(Q) can be easily
obtained by counting the number of trajectories in the original
dataset that contain the (ICD, Age) pairs in Q.
Let e(Q) be the answer of Q when it is issued on the
anonymized dataset. This is an estimate because a generalized
value is interpreted as any leaf node in the subtree rooted by
that value in the DGH. Therefore, an anonymized pair may
correspond to any pair of possible ICD codes and Age values,
assuming each pair is equally likely. The value of e(Q) can be
obtained by computing the probability that a trajectory in the
anonymized dataset satisfies Q, and then summing these probabilities across all trajectories.
To illustrate how an estimate can be computed, assume that
a data recipient issues a query for the number of patients diagnosed with ICD code 401.1 at age 39 using the anonymized
dataset in Fig. 1(c). Referring to the DGHs in Figs. 3 and
4, it can be seen that the only trajectories that may contain
(401.1, 39) are the last two since they contain the generalized
[1] D. Blumenthal, “Stimulating the adoption of health information technology,” New Engl. J. Med., vol. 360, no. 15, pp. 1477–1479, 2009.
[2] D. A. Ludwick and J. Doucette, “Adopting electronic medical records in
primary care: Lessons learned from health information systems implementation experience in seven countries,” Int. J. Med. Informat., vol. 78,
pp. 22–31, 2009.
[3] A. K. Jha, C. M. DesRoches, E. G. Campbell, K. Donelan, S. R. Rao,
T. G. Ferris, A. Shields, S. Rosenbaum, and D. Blumenthal, “Use of
electronic health records in U.S. hospitals,” New Engl. J. Med., vol. 360,
pp. 1628–1638, 2009.
[4] B. B. Dean, J. Lam, J. L. Natoli, Q. Butler, D. Aguilar, and R. J. Nordyke,
“Review: Use of electronic medical records for health outcomes research:
A literature review,” Med. Care Res. Rev., vol. 66, pp. 611–638, 2009.
[5] K. Holzer and W. Gall, “Utilizing IHE-based electronic health record
systems for secondary use,” Methods Inf. Med., vol. 50, no. 4, pp. 319–
325, 2011.
[6] C. Safran, M. Bloomrosen, W. Hammond, S. Labkoff, S. Markel-Fox,
P. Tang, and D. E. Detmer, “Toward a national framework for the secondary
use of health data: An American medical informatics association white
paper,” J. Amer. Med. Informat. Assoc., vol. 14, pp. 1–9, 2007.
[7] K. Tu, T. Mitiku, H. Guo, D. S. Lee, and J. V. Tu, “Myocardial infarction
and the validation of physician billing and hospitalization data using electronic medical records,” Chron. Diseases Canada, vol. 30, pp. 141–146,
2010.
[8] C. A. McCarty, R. L. Chisholm, C. G. Chute, I. J. Kullo, G. P. Jarvik, E. B.
Larson, R. Li, D. R. Masys, M. D. Ritchie, D. M. Roden, J. P. Struewing,
and W. A. Wolf, “The eMERGE network: A consortium of biorepositories
linked to electronic medical records data for conducting genomic studies,”
BMC Med. Genomics, vol. 4, pp. 13–23, 2011.
[9] I. J. Kullo, K. Ding, H. Jouni, C. Y. Smith, and C. G. Chute, “A genomewide association study of red blood cell traits using the electronic medical
record,” Public Library Sci. ONE, vol. 5, e13011, 2010.
[10] J. A. Pacheco, P. C. Avila, J. A. Thompson, M. Law, J. A. Quraishi,
A. K. Greiman, E. M. Just, and A. Kho, “A highly specific algorithm for
identifying asthma cases and controls for genome-wide association studies,” in Proc. Amer. Med. Informat. Assoc. Annu. Symp., 2009, pp. 497–
501.
[11] M. D. Ritchie, J. C. Denny, D. C. Crawford, A. H. Ramirez, J. B. Weiner,
J. M. Pulley, M. A. Basford, K. Brown-Gentry, J. R. Balser, D. R. Masys,
J. L. Haines, and D. M. Roden, “Robust replication of genotype-phenotype
associations across multiple diseases in an electronic medical record,”
Amer. J. Human Genet., vol. 86, no. 4, pp. 560–572, 2010.
[12] M. N. Liebman, “Personalized medicine: A perspective on the patient,
disease and causal diagnostics,” Personal. Med., vol. 4, no. 2, pp. 171–
174, 2007.
TAMERSOY et al.: ANONYMIZATION OF LONGITUDINAL ELECTRONIC MEDICAL RECORDS
[13] A. Tucker and D. Garway-Heath, “The pseudotemporal bootstrap for predicting glaucoma from cross-sectional visual field data,” IEEE Trans. Inf.
Technol. Biomed., vol. 14, no. 1, pp. 79–85, Jan. 2010.
[14] S. Jensen, “Mining medical data for predictive and sequential patterns,”
in Proc. 5th Eur. Conf. Principles Pract. Knowl. Discov. Databases, 2001,
pp. 1–10.
[15] P. R. Burton, A. L. Hansell, I. Fortier, T. A. Manolio, M. J. Khoury,
J. Little, and P. Elliott, “Size matters: Just how big is big? Quantifying
realistic sample size requirements for human genome epidemiology,” Int.
J. Epidemiol., vol. 38, pp. 263–273, 2009.
[16] C. C. Spencer, Z. Su, P. Donnelly, and J. Marchini, “Designing genomewide association studies: Sample size, power, imputation, and the choice
of genotyping chip,” Public Library Sci. Genet., vol. 5, no. 5, e1000477,
2009.
[17] Policy for Sharing of Data Obtained in NIH Supported or Conducted
Genome-Wide Association Studies (GWAS), Nat. Inst. Health, Bethesda,
MD, NOT-OD-07-088, 2007.
[18] M. D. Mailman, M. Feolo, Y. Jin, M. Kimura, K. Tryka, R. Bagoutdinov,
L. Hao, A. Kiang, J. Paschall, L. Phan, N. Popova, S. Pretel, L. Ziyabari,
M. Lee, Y. Shao, Z. Y. Wang, K. Sirotkin, M. Ward, M. Kholodov, K. Zbicz,
J. Beck, M. Kimelman, S. Shevelev, D. Preuss, E. Yaschenko, A. Graeff,
J. Ostell, and S. T. Sherry, “The NCBI dbgap database of genotypes and
phenotypes,” Nature genet., vol. 39, no. 10, pp. 1181–1186, 2007.
[19] Standards for Protection of Electronic Health Information, Federal Register, Dept. Health, Human Services, Washington, DC, 45 CFR Pt. 164,
2003.
[20] K. Benitez and B. Malin, “Evaluating re-identification risks with respect
to the HIPAA privacy rule,” J. Amer. Med. Informat. Assoc., vol. 17, no. 2,
pp. 169–177, 2010.
[21] G. Loukides, J. C. Denny, and B. Malin, “The disclosure of diagnosis
codes can breach research participants’ privacy,” J. Amer. Med. Informat.
Assoc., vol. 17, no. 3, pp. 322–327, 2010.
[22] G. Loukides, A. Gkoulalas-Divanis, and B. Malin, “Anonymization of
electronic medical records for validating genome-wide association studies,” Proc. Nat. Acad. Sci., vol. 107, no. 17, pp. 7898–7903, 2010.
[23] G. Loukides, A. Gkoulalas-Divanis, and B. Malin, “Privacy-preserving
publication of diagnosis codes for effective biomedical analysis,” in Proc.
10th IEEE Int. Conf. Inf. Technol. Appl. Biomed., 2010, pp. 1–6.
[24] L. Sweeney, “k-anonymity: A model for protecting privacy,” Int. J. Uncertainty, Fuzz. Knowl.-Based Syst., vol. 10, no. 5, pp. 557–570, 2002.
[25] B. C. M. Fung, K. Wang, R. Chen, and P. S. Yu, “Privacy-preserving
data publishing: A survey of recent developments,” ACM Comput. Surv.,
vol. 42, no. 4, pp. 1–53, 2010.
[26] B.-C. Chen, D. Kifer, K. LeFevre, and A. Machanavajjhala, “Privacypreserving data publishing,” Found. Trends Databases, vol. 2, no. 1–2,
pp. 1–167, 2009.
[27] R. S. Sandhu, E. J. Coyne, H. L. Feinstein, and C. E. Youman, “Role-based
access control models,” IEEE Comput., vol. 29, no. 2, pp. 38–47, Feb.
1996.
[28] B. Pinkas, “Cryptographic techniques for privacy-preserving data mining,” ACM Spec. Interest Group Knowl. Discov. Data Mining Explorat.,
vol. 4, no. 2, pp. 12–19, 2002.
[29] S. M. Diesburg and A.-I. A. Wang, “A survey of confidential data storage
and deletion methods,” ACM Comput. Surv., vol. 43, no. 1, pp. 1–37,
2010.
[30] N. R. Adam and J. C. Wortmann, “Security-control methods for statistical
databases: A comparative study,” ACM Comput. Surv., vol. 21, no. 4,
pp. 515–556, 1989.
[31] C. C. Aggarwal and P. S. Yu, “A survey of randomization methods
for privacy-preserving data mining,” in Privacy-Preserving Data Mining:
Models and Algorithms (ser. Advances in Database Systems), vol. 34.
New York: Springer-Verlag, 2008, pp. 137–156.
[32] L. Willenborg and T. De Waal, Statistical Disclosure Control in Practice
(ser. Lecture Notes in Statistics), vol. 111. New York: Springer-Verlag,
1996, pp. 1–152.
[33] N. Marsden-Haug, V. Foster, P. Gould, E. Elbert, H. Wang, and J. Pavlin,
“Code-based syndromic surveillance for influenzalike illness by international classification of diseases, ninth revision,” Emerg. Infect. Diseases,
vol. 13, pp. 207–216, 2007.
[34] P. Samarati, “Protecting respondents’ identities in microdata release,”
IEEE Trans. Knowl. Data Eng., vol. 13, no. 6, pp. 1010–1027, Nov./Dec.
2001.
[35] K. El Emam, F. K. Dankar, R. Issa, E. Jonker, D. Amyot, E. Cogo,
J. P. Corriveau, M. Walker, S. Chowdhury, R. Vaillancourt, T. Roffey,
and J. Bottomley, “A globally optimal k-anonymity method for the deidentification of health data,” J. Amer. Med. Informat. Assoc., vol. 16,
no. 5, pp. 670–682, 2009.
423
[36] K. LeFevre, D. J. DeWitt, and R. Ramakrishnan, “Mondrian multidimensional k-anonymity,” in Proc. 22nd IEEE Int. Conf. Data Eng., 2006,
pp. 25–35.
[37] T. Dalenius, “Finding a needle in a haystack—or identifying anonymous
census record,” J. Offic. Statist., vol. 2, no. 3, pp. 329–336, 1986.
[38] M. E. Nergiz and C. Clifton, “Thoughts on k-anonymization,” Data
Knowl. Eng., vol. 63, no. 3, pp. 622–645, 2007.
[39] G. Loukides and J. Shao, “Capturing data usefulness and privacy protection in k-anonymisation,” in Proc. 22nd ACM Symp. Appl. Comput., 2007,
pp. 370–374.
[40] V. S. Iyengar, “Transforming data to satisfy privacy constraints,” in Proc.
8th ACM Int. Conf. Knowl. Discov. Data Mining, 2002, pp. 279–288.
[41] K. El Emam and F. K. Dankar, “Protecting privacy using k-anonymity,”
J. Amer. Med. Informat. Assoc., vol. 15, no. 5, pp. 627–637, 2008.
[42] M. Terrovitis, N. Mamoulis, and P. Kalnis, “Privacy-preserving
anonymization of set-valued data,” in Proc. Very Large Data Bases Endowment, 2008, vol. 1, no. 1, pp. 115–125.
[43] A. Tamersoy, G. Loukides, J. C. Denny, and B. Malin, “Anonymization of
administrative billing codes with repeated diagnoses through censoring,”
in Proc. Amer. Med. Informat. Assoc. Annu. Symp., 2010, pp. 782–786.
[44] M. E. Nergiz, M. Atzori, Y. Saygin, and B. Guc, “Towards trajectory
anonymization: A generalization-based approach,” Trans. Data Privacy,
vol. 2, no. 1, pp. 47–75, 2009.
[45] O. Abul, F. Bonchi, and M. Nanni, “Never walk alone: Uncertainty for
anonymity in moving objects databases,” in Proc. 24nd IEEE Int. Conf.
Data Eng., 2008, pp. 376–385.
[46] M. Terrovitis and N. Mamoulis, “Privacy preservation in the publication of
trajectories,” in Proc. 9th Int. Conf. Mobile Data Manag., 2008, pp. 65–72.
[47] L. Sweeney, “Achieving k-anonymity privacy protection using generalization and suppression,” Int. J. Uncertainty, Fuzz. Knowl.-Based Syst.,
vol. 10, no. 5, pp. 571–588, 2002.
[48] C. C. Aggarwal and P. S. Yu, “A condensation approach to privacy preserving data mining,” in Proc. 9th Int. Conf. Extend. Database Technol.,
2004, pp. 183–199.
[49] G. Aggarwal, T. Feder, K. Kenthapadi, S. Khuller, R. Panigrahy,
D. Thomas, and A. Zhu, “Achieving anonymity via clustering,” in Proc.
25th ACM Symp. Principles Database Syst., 2006, pp. 153–162.
[50] S. B. Needleman and C. D. Wunsch, “A general method applicable to the
search for similarities in the amino acid sequence of two proteins,” J.
Molecular Biol., vol. 48, no. 3, pp. 443–453, 1970.
[51] T. F. Smith and M. S. Waterman, “Identification of common molecular
subsequences,” J. Molecular Biol., vol. 147, no. 1, pp. 195–197, 1981.
[52] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to
Algorithms, 2nd ed. Cambridge, MA: MIT Press, 2001.
[53] J. Domingo-Ferrer and J. M. Mateo-Sanz, “Practical data-oriented microaggregation for statistical disclosure control,” IEEE Trans. Knowl.
Data Eng., vol. 14, no. 1, pp. 189–201, Jan./Feb. 2002.
[54] J. Domingo-Ferrer and V. Torra, “Ordinal, continuous and heterogeneous
k-anonymity through microaggregation,” Data Mining Knowl. Discov.,
vol. 11, no. 2, pp. 195–212, 2005.
[55] D. M. Roden, J. M. Pulley, M. A. Basford, G. R. Bernard, E. W. Clayton,
J. R. Balser, and D. R. Masys, “Development of a large-scale de-identified
DNA biobank to enable personalized medicine,” Clin. Pharmacol. Therapeut., vol. 84, no. 3, pp. 362–369, 2008.
[56] A. H. Ramirez, J. S. Schildcrout, D. L. Blakemore, D. R. Masys, J. M.
Pulley, M. A. Basford, D. M. Roden, and J. C. Denny, “Modulators of normal electrocardiographic intervals identified in a large electronic medical
record,” Heart Rhythm, vol. 8, no. 2, pp. 271–277, 2011.
[57] J. C. Denny, M. D. Ritchie, D. C. Crawford, J. S. Schildcrout, A. H.
Ramirez, J. M. Pulley, M. A. Basford, D. R. Masys, J. L. Haines, and
D. M. Roden, “Identification of genomic predictors of atrioventricular
conduction: Using electronic medical records as a tool for genome science,” Circulation, vol. 122, pp. 2016–2021, 2010.
[58] C. C. Aggarwal, “On k-anonymity and the curse of dimensionality,” in
Proc. 31st Int. Conf. Very Large Data Bases, 2005, pp. 901–909.
[59] Y. Xu, K. Wang, A. W.-C. Fu, and P. S. Yu, “Anonymizing transaction
databases for publication,” in Proc. 14th ACM Int. Conf. Knowl. Discov.
Data Mining, 2008, pp. 767–775.
[60] A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam,
“-diversity: Privacy beyond k-anonymity,” in Proc. 22nd IEEE Int. Conf.
Data Eng., 2006, pp. 24–35.
Authors’ photographs and biographies not available at the time of publication.
Download