Privacy and Confidence Issue of Genomic Data

advertisement
CSE
5810
Privacy and Confidence Issue of
Genomic Data
Xin Li
Computer Science & Engineering Department
The University of Connecticut
xin.li@uconn.edu
Spring 2016
Li-1
Introduction

CSE
5810

Genomic data is the fundamental component of life
Outline



Genomic Data Introduction
 STR, SNPs, DNA Sequence
Privacy and confidence issue
 General strategy
 Methods for each types of data
– STR. SNPs, and sequence
– Law, methodology
Future work
Li-2
Genomic Data Introduction - STR

CSE
5810

Short tandem repeats, microsatellite DNA
How to obtain it




What is the data looks like


PCR (amplification)
Fluorescence (dyeing)
Electrophoresis(Measurement)
Peaks, Alleles
Applications

Diseases diagnosis
 Down’s syndrome

Medication
 TB(Tuberculosis)
drug resistance detection

Treatment Tracking
 Bone marrow transplant

Forensic
 Pedigree analysis
 Mixture analysis
Li-3
Genomic Data Introduction - SNPs

CSE
5810

Single Nucleotide Polymorphisms
Methods of SNP detection




Sequencing, Taqman Probe, Backman
SNP , SNPshot, DNAchip
Display as a group of peaks
Each peak stands for one SNP
Applications



Disease detection
Human evolution, phylogeny and human
migration
Personalized medicine
 Predict side effects of drugs
 Predict possible effects of Drugs
 Predict doses required
Li-4
Genomic Data Introduction - DNA Sequence

CSE
5810

Contains all types of information of individual
Three generations methods of DNA sequence detection


The first generation Method: Sanger Sequencing
The second generation method: Next Generation Sequence (NGS)
 Roche 454: The method of bead micro-reactor
 Illumina Solexa: The strategy of bridge amplification
 Applied Biosystems SOLID: The solution of Color Coding

The third generation method: Nanopore Sequencing
Li-5
Genomic Data Introduction - DNA Sequence

CSE
5810
NGS still the most popular method for
DNA sequencing

The first generation technology (such as
Sangar)
 High cost
 Low throughout
 Time consuming

The third generation technology (such as
nanopore)
 High cost
 High error ratio
 Low throughout


The strategy of privacy protection we
discuss later which corresponding to
sequence data will based on NGS.
Many applications based on sequencing
data




Medical diagnose
Clinic
Academic research
Personalized medication
Li-6
Privacy and Confidence – General Strategies

Three types of information be used for identification

CSE
5810



Administrative or demographic tags
Overt descriptors
Indirect clues
Three strategies of identities preprocessing before the data been published

Blurring, removing, destroying the information which could lead to
identification of the subjects
Administrative or
demographic tags
Information used
for Identification
Overt description
Week: has to link
with other data:
SSN & SSN DB
Strong:
Birthday, Genomic
Blurring
Gender, eye
color and etc
Removing
Destroying
Indirect clues
Number of
children and etc
Li-7
Privacy and Confidence – General Strategies
System Level (Permission Model)

RBAC, Contextual RBAC, SitBAC

CSE
5810
 RBAC0, RBAC1, RBAC2, RBAC3
User
Role
n
Permission
n
n
User
UserRoleRelation
n
1
n
n
n
1
1
1
1
1
Role
n
n
n
Session
User
*
Role
n
*n
1
n
1
1
1
RolePRMSRelation
n
n
1
SSD
Permission
Session
Permission
1
n
User
Role
n
Permission
*
n
n
n
SSD
n
1
1
1
Session
DSD
User
Role
n
Permission
*
n
n
n
n
1
1
1
Session
DSD
Li-8
Privacy and Confidence –Strategies based on genomic data

CSE
5810
We have to share genomic data with others
while analysis


Data comes from different places
Sharing data is inevitable



Need reference data
Correlation Analysis
Genomic Data 1
Genomic Data 1
Genomic Data n
What is contain by genomics data

Different types of Genetics information
based on different data type



STR, SNP, NGS.
name, gender, race and etc.
Hidden based on the general strategy
mentioned before.
Personal
Information
By General
strategies
Experiment
Information:
Machine
info, and etc
Genetics
Information
Complementary
information:
Allele Freq and
etc
The experimental information


Information
The personal information


machine info, running parameters and etc.
Complementary Information


STR, SNP, NGS
Privacy Issue
Allele frequency, mutation, indel, quality
and etc.
Discuss the privacy issue based on STR,
SNP, sequence data respectively
STR
SNP
NGS
Li-9
Privacy Issue of STR Data

CSE
5810

13 loci (CODIS) identify a person
with accuracy ratio 99.9%
Law and policy





FBI is forbidden to make the
kinship retrieval based on CODIS
dataset in most of states except
California, Colorado, Texas and
Virginia
Kinship retrieval through DNA
database is totally forbidden in
Maryland and D.C.
loci in non-coding area of
autosomal (Do not worry about it)
Loci in sex chromosome: Y-STR
database, Some restrict policies
been created in every steps while
the database building
“Genetic Information
Nondiscrimination Act” which is
known as GINA in order to
protect the people whose health
information been exposed by
genetics data.
STR
Autosomal
GINA
Sex Chromosome
Database & UI
design
Y-STR database
Coding Area
non-coding area
of autosomal
CODIS Database
Kinship
Retrieval
FBI forbidden
1: Do not contain
any Genomics
2: Do not worry
about the privacy
issue
Kinship
information
Some restrict
policies been
created in every
steps while the
database
building
Li-10
Privacy Issue of SNP Data

CSE
5810



Corresponding to some typical features of individual’s genome
(some potential diseases)
SNPs could be used to identify individuals as similar as STR
Law and Policy
Other Strategies in Methodology


Data Classification
Difference cases of Application
SNP
Ban using DNA
testing in hiring
or firing
employees
Law and Policy
some potential
diseases
Data Classification
Discrimination
stigmatization
In a study
published in by
Dorothy Wertz
loss of
insurance
Used as STR for
identification
loss of
employment
Genetic
Information
Nondiscriminati
on Act
provide little or
no direct
personal
information
yield commonly
occurring
haplotypes
Application
Classification
Less likely to be
considered a
privacy issue
tightly connect
with privacy
issue:
Phenotypic SNP
testing
Without restrict
constraint
Forensic
restrict usage
Without restrict
constraint
Forensic
restrict usage
Li-11
Privacy Issue of DNA Sequence Data

CSE
5810


Widely used in human
medical care
Privacy issues arising as
genomics matures
Two challenges




Individuals or third party
which may not be trusted
Find a way to share the
personal genomic data
without constraint of
privacy
Strategies based on
Genomic Data


Be abused
Individual information
Research
Identification
*Link Genotype
with phenotype
Balance Privacy
and research
Motivation of the strategies
of application


Protect the personal
information
Balance protecting
privacy and fostering
research
DNA
Sequence
Health Care
*Disease Diagnosis
*mental health
*illness risks
Identify
Law and
Policy
omnibus
regulations
Health information
portability and
accountability act
application
research
community
Protection of
human subject
De-Identify
1: Voluntarily Allowed to be public
2: Release on time
3: Based on Sequence length
Identify
De-identifying
Li-12
Privacy Issue of DNA Sequence Data

Three approaches to link the data with person

CSE
5810



Match genotype with reference genotype
Linking genomic data with other associated data
Profiling from genomic characteristics
Three method to de-identify the genomic data



Limiting the proportion of genome released
Statistically degrading the data before releasing
Sequestering identifiers via key-coding
Limiting the
proportion
of genome
released
Match Geno
with reference
Identify
De-Identify
Profile from
genomic
characteristics
Clinical and
social data
G<->A, C<->T
Statistically
degrading
the data
Phenotype
Link with
associated data
Defect
How much
limitation will
be applied
Leads the data
be useless
commercial
database
Publish only
limited segments
Sequesterin
g identifiers
via keycoding
Who
responsible to
de-identity the
key sytstem
Fuzz data by adding
statistical noise
Randomly altering /
exchanging a small
percentage of SNPs
Identifying data,
substantive data
and key
Li-13
Privacy Issue of DNA Sequence Data –Method 1

Main idea

CSE
5810



Individuals release special-purpose cryptographically protected
information about their genome
Do not contain any useful information about the individual’s genome
No information about the genomes of the individuals is revealed in the
process of identifying relatives
Framework

Use three individuals as the example
 Each of them contains 24 SNPs


4 phases include: data preprocess, data encryption, build security
genomic sketch (SGS) value, identification based on SGS.
Data preprocess

Give the threshold to distinguish the relatives from first generation to
the second generation and etc.
 If two individuals are relatives, it means that they share a number of same
DNA fragments.

Use SNP to express the genomic data (IBD Area, Reference)
 Divide the data into multiple segments
– Each segment contains the same number of SNPs
Li-14
Privacy Issue of DNA Sequence Data –Method 1

CSE
5810
Data encryption

Use “fuzzy encryption” to encrypt the data
 Traditional encryption schemes
– Key required for decryption must be identical to the key used in
encryption
 Fuzzy encryption schemes
– The encryption key and decryption key only need to be similar
» The similar corresponds to the relatives with different generation
– The coding result we called genome sketch (GS)

Five steps of construction
 Convert the values of the haplotypes of each segment into a pair of binary
numbers
– Each digit represents a SNP position in the segment
– “0” represents the major allele and “1” represents the minor allele
 Present the segment number as a binary number
– by the position where are those segments located in original haplotype
Li-15
Privacy Issue of DNA Sequence Data –Method 1

CSE
5810
Data encryption

Five steps of construction
 Transfer the segment binary value to hash value by hash function
– Concept
» Decrease the collision and guarantee different segments even just with one place
difference will have totally different hash value
– One way for choose hash function
» Adding the value of sequence binary and segment number binary together and choose
several number of tail bits as hash value of this segment
» In our example the length of each binary unit is 3
– Every segment will be set an individual binary value
 Delete the duplicate binary value
– The remaining binary code be viewed as the value of genome sketch (GS) of current
genome
– Relative identification
» If there are 3/4 GS unit value could be match with any two individuals, we thing they
are relatives
 Present the full GS of an individual as a vector of size 2^k
– k is the number of possible sketch values
» In our case the length is 2^3 = 8
– Each position in the vector corresponds to a potential sketch element
» vector has a ‘1’ if the individual’s GS contains that element and has a ‘0’ otherwise
Li-16
Privacy Issue of DNA Sequence Data –Method 1

Fuzzy Encryption (GS: Genomic Sketch)
CSE
5810
Li-17
Privacy Issue of DNA Sequence Data –Method 1

Advantage of GS vector

CSE
5810
Hard for others transfer hash value back to original genomic value
 We use hash function to build such value



much memory & disk saving compare with the original sequence value
It will be used for the comparison of genomic data later
Build Security Genomic Sketch (SGS) Value


Choose one row in ECC matrix randomly and add it with its GS value
For ECC matrix
 The column dimension should be as the same as the length of GS code

The SGS value will be final published to public database
 This value come from random combination
 Cannot be used for another application but only relative identification
Li-18
Privacy Issue of DNA Sequence Data –Method 1

Build Security Genomic Sketch (SGS) Value
CSE
5810
Li-19
Privacy Issue of DNA Sequence Data –Method 1

CSE
5810
Identification based on SGS

Use our own GS value and the public SGS value for relative identification
 Use SGS substitute GS value, and view such substituted code as inquiry code
 Compare the inquiry code with each row of ECC matrix
– Identify the threshold for difference to identify the relationship between two individuals.
Li-20
Privacy Issue of DNA Sequence Data –Method 1

Some aspects should be noticed

CSE
5810
In real case, the dimension of coding turns very big
 Since the length of genome is very long, the length of GS code will be
very long as well.
 The ECC matrix will have width 2^24

Computational complexity be increased in both encoding and decoding
of GSs
 Much larger number of segments and sketch elements

Utilizes an improved version of the Juels-Sudan construction
 To scale to the genome

Major contributions of this method



Provides an efficient way of genome encoding based on SNP points
Provides an efficient way of decoding based on the set of GS
Provides a method of create SGS from GS, and how to make the
relationship detection based on the coding of GS and SGS.
Li-21
Privacy Issue of DNA Sequence Data –Method 2

Compare with method 1

CSE
5810

Requires both individual genomic data and the reference data
Different data be used for relationship detection


Consider all the variants rather than just IBD area
Main idea




Fuzzy Extractor VS Traditional encryption and decryption protocol
Create both private key and public key for genomic data
Build Genomic Sketch (GS): Hash function
Build Secure Genomic Sketch (SGS)
 Based on the method called list decoding (Not ECC Matrix)


Find Relatives: Based on the private key and public key
Very similar as the previous method
Li-22
Feature Work

CSE
5810


Let the public key contain more types feature information
which could let us identify the relationship of more generations
How to save those public keys more efficiency without privacy
issue
How to balance the space complexity and the time complexity
in the algorithm level


Currently, if you want to save all types of information into the public
key, such code will be too long to be calculated while comparison
“Advanced” RBAC Model


Contextual RBAC: Contextual role–based access control authorization
model
SitBAC: Situation-based access control model
Li-23
Download