CSE 5810 Privacy and Confidence Issue of Genomic Data Xin Li Computer Science & Engineering Department The University of Connecticut xin.li@uconn.edu Spring 2016 Li-1 Introduction CSE 5810 Genomic data is the fundamental component of life Outline Genomic Data Introduction STR, SNPs, DNA Sequence Privacy and confidence issue General strategy Methods for each types of data – STR. SNPs, and sequence – Law, methodology Future work Li-2 Genomic Data Introduction - STR CSE 5810 Short tandem repeats, microsatellite DNA How to obtain it What is the data looks like PCR (amplification) Fluorescence (dyeing) Electrophoresis(Measurement) Peaks, Alleles Applications Diseases diagnosis Down’s syndrome Medication TB(Tuberculosis) drug resistance detection Treatment Tracking Bone marrow transplant Forensic Pedigree analysis Mixture analysis Li-3 Genomic Data Introduction - SNPs CSE 5810 Single Nucleotide Polymorphisms Methods of SNP detection Sequencing, Taqman Probe, Backman SNP , SNPshot, DNAchip Display as a group of peaks Each peak stands for one SNP Applications Disease detection Human evolution, phylogeny and human migration Personalized medicine Predict side effects of drugs Predict possible effects of Drugs Predict doses required Li-4 Genomic Data Introduction - DNA Sequence CSE 5810 Contains all types of information of individual Three generations methods of DNA sequence detection The first generation Method: Sanger Sequencing The second generation method: Next Generation Sequence (NGS) Roche 454: The method of bead micro-reactor Illumina Solexa: The strategy of bridge amplification Applied Biosystems SOLID: The solution of Color Coding The third generation method: Nanopore Sequencing Li-5 Genomic Data Introduction - DNA Sequence CSE 5810 NGS still the most popular method for DNA sequencing The first generation technology (such as Sangar) High cost Low throughout Time consuming The third generation technology (such as nanopore) High cost High error ratio Low throughout The strategy of privacy protection we discuss later which corresponding to sequence data will based on NGS. Many applications based on sequencing data Medical diagnose Clinic Academic research Personalized medication Li-6 Privacy and Confidence – General Strategies Three types of information be used for identification CSE 5810 Administrative or demographic tags Overt descriptors Indirect clues Three strategies of identities preprocessing before the data been published Blurring, removing, destroying the information which could lead to identification of the subjects Administrative or demographic tags Information used for Identification Overt description Week: has to link with other data: SSN & SSN DB Strong: Birthday, Genomic Blurring Gender, eye color and etc Removing Destroying Indirect clues Number of children and etc Li-7 Privacy and Confidence – General Strategies System Level (Permission Model) RBAC, Contextual RBAC, SitBAC CSE 5810 RBAC0, RBAC1, RBAC2, RBAC3 User Role n Permission n n User UserRoleRelation n 1 n n n 1 1 1 1 1 Role n n n Session User * Role n *n 1 n 1 1 1 RolePRMSRelation n n 1 SSD Permission Session Permission 1 n User Role n Permission * n n n SSD n 1 1 1 Session DSD User Role n Permission * n n n n 1 1 1 Session DSD Li-8 Privacy and Confidence –Strategies based on genomic data CSE 5810 We have to share genomic data with others while analysis Data comes from different places Sharing data is inevitable Need reference data Correlation Analysis Genomic Data 1 Genomic Data 1 Genomic Data n What is contain by genomics data Different types of Genetics information based on different data type STR, SNP, NGS. name, gender, race and etc. Hidden based on the general strategy mentioned before. Personal Information By General strategies Experiment Information: Machine info, and etc Genetics Information Complementary information: Allele Freq and etc The experimental information Information The personal information machine info, running parameters and etc. Complementary Information STR, SNP, NGS Privacy Issue Allele frequency, mutation, indel, quality and etc. Discuss the privacy issue based on STR, SNP, sequence data respectively STR SNP NGS Li-9 Privacy Issue of STR Data CSE 5810 13 loci (CODIS) identify a person with accuracy ratio 99.9% Law and policy FBI is forbidden to make the kinship retrieval based on CODIS dataset in most of states except California, Colorado, Texas and Virginia Kinship retrieval through DNA database is totally forbidden in Maryland and D.C. loci in non-coding area of autosomal (Do not worry about it) Loci in sex chromosome: Y-STR database, Some restrict policies been created in every steps while the database building “Genetic Information Nondiscrimination Act” which is known as GINA in order to protect the people whose health information been exposed by genetics data. STR Autosomal GINA Sex Chromosome Database & UI design Y-STR database Coding Area non-coding area of autosomal CODIS Database Kinship Retrieval FBI forbidden 1: Do not contain any Genomics 2: Do not worry about the privacy issue Kinship information Some restrict policies been created in every steps while the database building Li-10 Privacy Issue of SNP Data CSE 5810 Corresponding to some typical features of individual’s genome (some potential diseases) SNPs could be used to identify individuals as similar as STR Law and Policy Other Strategies in Methodology Data Classification Difference cases of Application SNP Ban using DNA testing in hiring or firing employees Law and Policy some potential diseases Data Classification Discrimination stigmatization In a study published in by Dorothy Wertz loss of insurance Used as STR for identification loss of employment Genetic Information Nondiscriminati on Act provide little or no direct personal information yield commonly occurring haplotypes Application Classification Less likely to be considered a privacy issue tightly connect with privacy issue: Phenotypic SNP testing Without restrict constraint Forensic restrict usage Without restrict constraint Forensic restrict usage Li-11 Privacy Issue of DNA Sequence Data CSE 5810 Widely used in human medical care Privacy issues arising as genomics matures Two challenges Individuals or third party which may not be trusted Find a way to share the personal genomic data without constraint of privacy Strategies based on Genomic Data Be abused Individual information Research Identification *Link Genotype with phenotype Balance Privacy and research Motivation of the strategies of application Protect the personal information Balance protecting privacy and fostering research DNA Sequence Health Care *Disease Diagnosis *mental health *illness risks Identify Law and Policy omnibus regulations Health information portability and accountability act application research community Protection of human subject De-Identify 1: Voluntarily Allowed to be public 2: Release on time 3: Based on Sequence length Identify De-identifying Li-12 Privacy Issue of DNA Sequence Data Three approaches to link the data with person CSE 5810 Match genotype with reference genotype Linking genomic data with other associated data Profiling from genomic characteristics Three method to de-identify the genomic data Limiting the proportion of genome released Statistically degrading the data before releasing Sequestering identifiers via key-coding Limiting the proportion of genome released Match Geno with reference Identify De-Identify Profile from genomic characteristics Clinical and social data G<->A, C<->T Statistically degrading the data Phenotype Link with associated data Defect How much limitation will be applied Leads the data be useless commercial database Publish only limited segments Sequesterin g identifiers via keycoding Who responsible to de-identity the key sytstem Fuzz data by adding statistical noise Randomly altering / exchanging a small percentage of SNPs Identifying data, substantive data and key Li-13 Privacy Issue of DNA Sequence Data –Method 1 Main idea CSE 5810 Individuals release special-purpose cryptographically protected information about their genome Do not contain any useful information about the individual’s genome No information about the genomes of the individuals is revealed in the process of identifying relatives Framework Use three individuals as the example Each of them contains 24 SNPs 4 phases include: data preprocess, data encryption, build security genomic sketch (SGS) value, identification based on SGS. Data preprocess Give the threshold to distinguish the relatives from first generation to the second generation and etc. If two individuals are relatives, it means that they share a number of same DNA fragments. Use SNP to express the genomic data (IBD Area, Reference) Divide the data into multiple segments – Each segment contains the same number of SNPs Li-14 Privacy Issue of DNA Sequence Data –Method 1 CSE 5810 Data encryption Use “fuzzy encryption” to encrypt the data Traditional encryption schemes – Key required for decryption must be identical to the key used in encryption Fuzzy encryption schemes – The encryption key and decryption key only need to be similar » The similar corresponds to the relatives with different generation – The coding result we called genome sketch (GS) Five steps of construction Convert the values of the haplotypes of each segment into a pair of binary numbers – Each digit represents a SNP position in the segment – “0” represents the major allele and “1” represents the minor allele Present the segment number as a binary number – by the position where are those segments located in original haplotype Li-15 Privacy Issue of DNA Sequence Data –Method 1 CSE 5810 Data encryption Five steps of construction Transfer the segment binary value to hash value by hash function – Concept » Decrease the collision and guarantee different segments even just with one place difference will have totally different hash value – One way for choose hash function » Adding the value of sequence binary and segment number binary together and choose several number of tail bits as hash value of this segment » In our example the length of each binary unit is 3 – Every segment will be set an individual binary value Delete the duplicate binary value – The remaining binary code be viewed as the value of genome sketch (GS) of current genome – Relative identification » If there are 3/4 GS unit value could be match with any two individuals, we thing they are relatives Present the full GS of an individual as a vector of size 2^k – k is the number of possible sketch values » In our case the length is 2^3 = 8 – Each position in the vector corresponds to a potential sketch element » vector has a ‘1’ if the individual’s GS contains that element and has a ‘0’ otherwise Li-16 Privacy Issue of DNA Sequence Data –Method 1 Fuzzy Encryption (GS: Genomic Sketch) CSE 5810 Li-17 Privacy Issue of DNA Sequence Data –Method 1 Advantage of GS vector CSE 5810 Hard for others transfer hash value back to original genomic value We use hash function to build such value much memory & disk saving compare with the original sequence value It will be used for the comparison of genomic data later Build Security Genomic Sketch (SGS) Value Choose one row in ECC matrix randomly and add it with its GS value For ECC matrix The column dimension should be as the same as the length of GS code The SGS value will be final published to public database This value come from random combination Cannot be used for another application but only relative identification Li-18 Privacy Issue of DNA Sequence Data –Method 1 Build Security Genomic Sketch (SGS) Value CSE 5810 Li-19 Privacy Issue of DNA Sequence Data –Method 1 CSE 5810 Identification based on SGS Use our own GS value and the public SGS value for relative identification Use SGS substitute GS value, and view such substituted code as inquiry code Compare the inquiry code with each row of ECC matrix – Identify the threshold for difference to identify the relationship between two individuals. Li-20 Privacy Issue of DNA Sequence Data –Method 1 Some aspects should be noticed CSE 5810 In real case, the dimension of coding turns very big Since the length of genome is very long, the length of GS code will be very long as well. The ECC matrix will have width 2^24 Computational complexity be increased in both encoding and decoding of GSs Much larger number of segments and sketch elements Utilizes an improved version of the Juels-Sudan construction To scale to the genome Major contributions of this method Provides an efficient way of genome encoding based on SNP points Provides an efficient way of decoding based on the set of GS Provides a method of create SGS from GS, and how to make the relationship detection based on the coding of GS and SGS. Li-21 Privacy Issue of DNA Sequence Data –Method 2 Compare with method 1 CSE 5810 Requires both individual genomic data and the reference data Different data be used for relationship detection Consider all the variants rather than just IBD area Main idea Fuzzy Extractor VS Traditional encryption and decryption protocol Create both private key and public key for genomic data Build Genomic Sketch (GS): Hash function Build Secure Genomic Sketch (SGS) Based on the method called list decoding (Not ECC Matrix) Find Relatives: Based on the private key and public key Very similar as the previous method Li-22 Feature Work CSE 5810 Let the public key contain more types feature information which could let us identify the relationship of more generations How to save those public keys more efficiency without privacy issue How to balance the space complexity and the time complexity in the algorithm level Currently, if you want to save all types of information into the public key, such code will be too long to be calculated while comparison “Advanced” RBAC Model Contextual RBAC: Contextual role–based access control authorization model SitBAC: Situation-based access control model Li-23