Privacy Preserving Data Mining

Healthcare privacy and security:
Genomic data privacy
Li Xiong
CS573 Data Privacy and Security
Genomic data privacy
 Genomic data are increasingly collected, stored, and
shared in research and clinical environments
 Genomic data are person-specific (there exists no
public registrar that maps genomes to names of
 Genomic data is not specified as an identifying
patient attribute under HIPAA privacy rule and may
be released for public research purposes
How can person-specific DNA be shared, such that it
cannot be associated to its explicit identity?
Data sharing scenario
 John Smith admitted to a local hospital which
stores clinical and DNA information
 John visits other hospitals
 The hospital forward certain DNA data onto a
research group, with institution and
pseudonyms of the patients
 The hospital sends identified discharge
record onto a state-controlled database
Data at a specific location
 Identified table of patient demographics
 De-identified DNA sequences
 Can we uniquely link identified data to DNA
Data at multiple locations
 Each site has an identified
table and de-identified DNA
 Can we uniquely link
identified data to DNA
 The set of
locations each
patient visited is
called a trail
 The trails can be
tracked and
matched to link
DNA data to
identified data
 Re-identification of data in trails (REIDIT) for
complete publishing
 If there is a unique trail match, then a re-
identification occurred
REIDIT-C reidentification
 Re-identifiability related to average # people
per location
Reserved publishing
 Data releasers can reserve certain information
 N is reserved to P vs. P is reserved to N
REIDIT - Incomplete
 REIDIT for reserved publishing
 For each trail in the track with incomplete
trails, if there is only one supertrail, then a reidentification occurred
 Remove the re-identified supertrail
Important because a trail can be a supertrail to
many trails
 Repeat the process
0.0, 0.1, 0.5, 0.9: probability of reserving information; hospital rank based on # of patients
Can masking location help?
Not necessarily!
Comments and open issues
 Can k-anonymity solve the problem?
 Pseudonyms subject to dictionary attacks,
how to allow linkage of the data without
 Genomic protection methods incorporating
utility of the genomic data
e.g. Utah Resource for Genetic and Epidemiologic Research (RGE)