Selection of Multiple SNPs in CaseControl Association Study Using a Discretized Network Flow Approach Shantanu Dutt, Yang Dai Huan Ren, Joel Fontanarosa University of Illinois at Chicago Outline Background: Genome Wide Association Study Problem Definition Previous Work Our Work: MIP Formulations Discretized Network Flow (DNF) Opt. Method DNF Solutions for k-SNP Selection w/ Clustering/Classification Experimental Results Conclusions Genetic Association Studies Goal: Find markers of variation that reliably distinguish individuals with a disease from a healthy population Single Nucleotide Polymorphisms (SNPs) are the simplest and most common form of variation in the human genome. Each chromosome has one of two alleles for each SNP Possible Genotypes = {0/0, 0/1, 1/1} Variations measured at specific SNP loci have been shown to be associated with numerous traits and diseases. Person 1 Person 2 SNP SNP Person 3 chrom 1 chrom 1 chrom 1 chrom 2 chrom 2 chrom 2 SNP Genetic Association Studies (contd) Genomic Variation Gene, Protein, or Cellular Alteration/Regulation Altered Phenotype - Individual traits (eg height, hair color) - Causal factors for disease - Increased risk factors for complex disease Images: pdb (ww.rcsb.org) Robbins and Cotran, 7th Ed 2005 Genetic Association Studies (contd) Complex traits cannot be mapped to a single genetic locus Multiple interacting genetic influences combine with environmental factors to produce an outcome Gene Networks A B ... X Disease Environment Genetic Association Studies (contd) Genome Wide Association Study (GWAS): Measure a large number of SNPs (typically 500K-1M) across the genome in a large case-control study (often >1000 patients) Results are commonly reported based on individual χ2 values, ignoring potentially powerful interaction effects It remains an open computational and statistical challenge to reliably analyze epistasis, or gene-gene interactions, in large-scale GWAS. Different genetic variations common complex disease Problem Definition: For a given set P of cases and Q of controls, classify the cases into different clusters and simultaneously select k significant marker SNPs for them (those that strongly distinguish these cases from the set Q) In this paper, we present a new optimization technique called discretized network flow (DNF) for the above problem Examples of Epistasis Methods Combinatorial MDR = multifactor dimensionality reduction CSP = combinatorial search based prediction CPM = combinatorial partitioning method Probabilistic BEAM = Bayesian Epistasis Association Mapping Bayesian partitioning model resolved by Markov Chain Monte Carlo (MCMC) methods megaSNPhunter Hierarchical learning algorithm (regression trees) Primarily considers local interaction effects MDR: Ritchie et al, Gen Epid, 2003 CSP: Brinza et al., WABI’06 CPM: Nelson et al, Genome Research, 2001 BEAM: Zhang and Liu, Nature Genetics, 2007 megaSNPhunter: Wan et al, BMC Bioinformatics, 2009 MDR 1. Divide data into training and testing sets 2. Select a set of N factors 3. If (affected/unaffected) > T (e.g. T = 1.0) high risk; o/w low risk 4. Select model with best misclassification error 5-6. Estimate the model prediction error using the testing data set. Repeat these steps for each cross validation iteration, and for each possible combination of factors. Adapted from Ritchie et al, Gen Epid, 2003 CSP: Combinatorial Methods for Disease Association Search and Susceptibility Prediction Risk/resistance factor multi-SNP combination (MSC) Problem: Find all MSCs significantly associated with the disease Cluster C: subset of S with an MSC, S : the original SNP set d(C) : # of diseased, h(C) : # of non-diseased Combinatorial search Definition: Disease-closure of a multi-SNP combination C is a multi-SNP combination C’, with maximum number of SNPs, which consists of the same set of disease individuals and minimum number of non-disease individuals. Searches only closed clusters Closure of cluster C = C’ d(C’)=d(C) and h(C’) is minimized Avoids checking of trivial MSCs Small d(C) implies not looking in subclusters Finds faster associated MSCs but still too slow Tagging: compress the SNP set by extracting most informative SNPs restore other SNPs from tag SNPs multiple regression method for tagging Brinza, D., Zelikovsky, WABI’06 Our Work: MIP Formulation Notations: pi,j(x) (0≤j ≤2): =1 if allele j present on SNP i for individual x; =0, otherwise. Marker mi,jval (val=0,1): mi,j1 means presence of allele j in SNP i mi,j0 means absence of allele j in SNP i Per-case benefitnc function of SNP i and allele j p ( z) z 1 i , j b ( x) | p ( x) | nc is # of controls i, j Claim i, j nc bi,j(x) is consistent with the specificity provided by selecting marker mi,jpi,j(x) When pi,j(x)=1: bi,j(x) lower fraction of non-patients have pi,j=1= pi,j(x) higher fraction of non-patients have pi,j=0= pi,j(x) When pi,j(x)=0: bi,j(x) higher fraction of non-patients have pi,j=1= pi,j(x) MIP Formulation Benefit-based case-pair similarity metric s( x, y, mival ,j ) (bi , j ( x)) (bi , j ( y)) if pi , j ( x) pi , j ( y) val Otherwise (indicating mx,yval is not a common marker for patients x and y) MIP formulation for selecting one marker set for all patients: val MAX: mval 1 x np 1 y np s( x, y, mival ) d ( m ,j i, j ) i, j S.T. 0 1 d ( m ) d ( m j 1 i, j i , j ) 1 SNPs i 3 0 1 ( d ( m ) d ( m i, j i, j i , j )) k •d(mi,jval) =1 if maker mi,jval is selected; np is the # of patients/cases •At most k markers will be selected • Linear MIP; MIP can be solved with commercial tools such as CPLEX/LINGO. However, very time consuming. •The similarity definition ensures that only common markers among patients will be selected. MIP Formulation (contd) Issue 1: Genetic reasons of a disease for diff. patient sets (e.g., w/ different ethnicity) can be different. Hence, selecting only one marker set is not appropriate (it artificially forces one marker set on the entire patient pop). Solution: Simultaneously cluster patients and select different markers for different clusters g g g val MAX: 1g G mval 1 x np 1 y np s( x, y, mival ) b b d ( m ,j x y i, j ) i, j S.T. 3 g 0 g 1 d ( m ) d ( m i, j i , j ) 1 SNPs i and clusters g j 1 g b 1 x x 1 g G g 0 g 1 ( d ( m ) d ( m i , j i , j )) k i, j • bxg: if x is in cluster g dg(mi,jval): if marker mi,jval is selected for cluster g. At most G cluster will be generated. • Cubic MIP! MIP Formulation (contd) Issue 2: the sum of benefit is not consistent with the specifity of a set of markers Essentially, the previous formulation will select five common markers with the highest benefit. However, it is not optimal. Mismatch marker 1 Mismatch marker 2 Mismatch marker 3 Mismatch marker 4 Control set Individually, marker 1 and 2 provide larger speicfity than marker 3 and 4 (mismatch more controls). However, the mismatch set of marker 1 and 2 have larger overlap. Select marker 3 and 4 as the marker set gives overall higher specifity MIP Formulation (contd) Adding accurate specifity terms to the obj. func. for each control z : (1 M ( z)) g 1i G i mis Mi(z) : whether control z matches the marker set selected for cluster i; Mi(z) is the mod 2 addition (Boolean OR) of various 0/1 vars gmis: objective function gain for mismatching a control. g g g val MAX: 1gG mval 1 xnp 1 ynp s( x, y, mival ) b b d ( m ,j x y i, j ) i, j 1 z nc (1 M i ( z )) g mis 1i G Final objective function At least cubic MIP (if G <= 3) gmis is determined so that specificity and sensitivity are given the same weight. Average gain for a patient matching a marker set: 2kbavgα(np/G), where np is the number of patients, and G is the number of groups. gmis =2kbavgα(np/G)*np/nc Discretized Network Flow (DNF) Standard min-cost network flow Find a min cost way to send a certain amount of flow from the source node (S) to the sink node (T). Capacity cost (2,0) (1,4) (2,0) (2,0) s MEA f=1 (1,1) (2,0) (2,0) (2,0) Valid flow T (1,2) Invalid flow Solves certain LP problems (continuous solns) Some discrete constraints have to be staisfied in order to solve discrete opt. problems like MIP One such constraint: Mutually exclusive arc set (MEA): At most one arc of a subset of arcs in this set can have flow on it. Discretized Network Flow (contd) Satisfying MEA requirements Adding a flow-amount-independent cost C’ to each arc in the set, A constant C’ cost is incurred whenever there is flow on the arc c Standard linear flow cost With C’ cost c Cap(e) f C’ C’ C’ C’ C’ MEA sets Cap(e) f C’inv: total C’-related cost for invalid flow C’val: total C’-related cost for valid flow C’inv≥C’val+C’ Discretized Network Flow (contd) Determining C’: In the standard network flow graph Heuristically select a valid flow & determine its cost Cval Without C’ Cinv Cvalmin Cval Obtain min-cost flow of cost Cinvmin w/o discretization constraints With C’ Cvalmin Cval+ Cinv+ +C’val C’val C’inv Set C’=Cval-Cinvmin+1 Since C’inv≥C’val+C’, a valid flow is guaranteed to have a smaller cost than any invalid flow. Theorem [Ren et al., ICCAD’08]: A min-cost flow with C’-costs on MEA arcs ensures MEA satisfaction Discretized Network Flow (contd) Discrete network flow has been applied to VLSI CAD problems [Ren et al., ICCAD’08], [Ren et al., IWLS’08], [Dutt et al., ICCAD’06] Good run time and scalability. At least 10x to 60x times faster than CPLEX with similar quality Example: determine optimal cell sizes in a circuit under an area constraint run time (secs) Four sizes available. The number of 0/1 variables is about four times the number of cells considered. 1500 y = 0.3823x + 8.5251 1000 Run time vs. the number of cells from [Ren et al., IWLS’08] 500 0 0 1000 2000 3000 # of cells considered 4000 DNF Model for Single-Cluster Marker Selection ci,j(x) (1, -s(x,y,pi,j Pm P1 … … S P1 Pm No connection otherwise T (np,0) cost cap f=np (np*k,0) p1,1 p1,2 MEA From S (np,0) MEA: only k arcs SN can have flow Px f=1 p1,3 p1,3 MEA pN,1 pN,1 pN,3 pN,3 … f=np*k p1,1 S1 … Complete bi-partite graph with meta arcs )) if ci,j(x)=ci,j(y) Py Flow through pi,j node in Px means d(mi,jpi,j(x))=1 Pairwise connection between pi,j nodes ensures the same marker set is selected for all Px The flow cost incurred for selecting a common marker between two patients is: -s(x,y,mi,jpi,j(x)) To T Marker Selection for Multiple Clusters Use multiple copies of the single cluster network model Type 1 invalid flow: Flow puts P1 in both cluster 1 and 2 Cluster 1 MEA P1 P1 P2 P2 P3 P3 P4 Complete bipartite P1 P1 MEA Type 2 invalid flow: Flow thru P1 passes thru P2 that is not in the same cluster, incurring false costs. P4 S Choice nodes Cluster 2 P2 P2 P3 P3 P4 P4 T MEA prevents invalid flows Example valid flow: Puts patients {1,4} in cluster 1, and {3,2} in cluster 2. For a G clusters will have G copies of the 2-level compl. bipartite graph; not all G clusters may be formed Marker Selection for Multiple Clusters Issue: When G is large, the network flow graph become very complex We use iterative bi-partitioning instead Much harder bi-part prob than standard bi-part; bi-part criterion needs to be Another run-time reduction selected simultaneously w/ bi-part! technique: Patient pre-clustering Group patients before using DNF. Condition for stopping the bipartitioning of a cluster: The Greedy iterative grouping method spec+sens deteriorates Initially, each patient is a subgroup Each time merge the two subgroups with most common SNP-allele pairs. Termination condition: patients in one group must have at least Final solution 70% SNP-allele pairs in common. Each group is taken as a “meta Meet termination patient” in DNF condition Groups opened up after DNF, and Meet termination metrics eval. at the individual condition level Chain Structure for Improving Specificity (1 M i ( z )) g mis From S MEA MEA MM chain cost=-gmis Cluster i T Chain structure for control z (cap, cost) A1 (1,0) Cluster 1 A2 M chain Cluster 2 Ag cost=0 Cluster g One chain structure for each controls. Two subchains: mismatched (MM) chain and matched (M) chain. One injection arc to M subchain from each cluster: A1......Ag. Injection flow on arc Ai means z matches the selected marker set of cluster i (Mi(z)=1). Chain flow stays on the MM chain if no injection arc has flow, and incurs cost of -gmis Any injection flow causes the MEA condition to force chain flow into M chain, and never switch back. Hence, incur 0 cost. Experimental Results Data set we use Crohn’s disease: 144 cases, 243 controls and 103 SNPs Autoimmune disorder: 384 cases, 652 controls and 108 SNPs Tick-borne encephalitis: 21 cases, 54 controls and 41 SNPs Rheumatoid arthritis: 460 cases, 460 controls and 2300 SNPs Lung cancer: 322 cases, 273 controls and 141 SNPs Rheumatoid arthritis (large): 868 cases, 1194 controls and 5000 SNPs Prediction scheme with multiple cluster marker sets Predict as Marker Match Mismatch healthy set 1 Test 1 Test 2 Predict as sick Mismatch Marker Mismatch set 2 TP: correctly predicted as sick FP: falsely predicted as sick TN: correctly predicted as healthy FN: falsely predicted as healthy Sensitivity=TP/(FN+TP) Specifity=TN/(FP+TN) Accuracy=(TN+TP)/(FP+TN+FN+TP) Machine configurations: 3G cpu, 1G mem, Windows machine. Experimental Results Five-fold cross validation K=10 results for Rheum. (large, no comparisons available): sens: 85; spec: 80; accuracy: 82 ;10 clusters; 21.5 h per training run Spec. Comparisons to MDR: 78.1 56.7 81.9 38% relatively 120 100 80 60 40 20 0 Autoimm. Crohn Tick-borne Lung Cancer Rheum. MDR(k=5) DNF(k=5) DNF(k=10) Avg Specifity 87.6 48.8 88.4 120 Sens 100 80 79% relatively 60 40 20 0 Aut oi mm. Cr ohn Ti ckbor ne Lung Cancer Sensitivity Rheum. Avg MDR( k=5) DNF( k=5) DNF( k=10) # of clusters K=5 K=10 Autoimm. 12 16 Crohn. 12 16 Tickborne 6 6 Lung cancer 14 16 Rheum 13 14 Experimental Results Comparisons to CSP [Brinza commun. 4/09, Brinza et al., WABI’06 ppt: http://www.cs.ucsd.edu/~dbrinza/cv/present/brinza_wabi06.ppt] Leave-one-out validation For DNF, 20 runs are performed with randomly chosen left-out individuals CSP performs n runs for n individuals (cases+controls) 100 120 95 100 Spec. 85 80 70 Autoimm. Crohn Tickborne 60 40 DNF(k=10) CSP 75 36% relatively 80 Sens. 83.1 85 2.4% relatively 90 96.6 71.1 20 DNF(k=10) CSP 0 Avg Autoimm. 90.6 76.8 18% relatively 120 100 80 60 40 20 0 Autoimm. Crohn Tickborne Avg Sensitivity Tick-borne Avg Geometric mean of sens. and spec. DNF(k=10) CSP Run time (ksec) Geometric mean Specifity Crohn 70 60 50 40 30 20 10 0 3k Autoimm. Crohn Tickborne 24k 8 times Avg DNF(k=10) CSP Run time (ksecs, per leave-out run) Experimental Results Leave-one-out validation 120 76.6 Accuracy 100 90.8 80 19% relatively 60 40 20 0 Autoimm. Crohn Tick-borne Accuracy Avg DNF(k=10) CSP Autoimm. 18 Crohn. 16 Tick-borne 6 Lung cancer 17 Rheum 14 Average number of clusters Experimental Results Comparing to LINGO (<= 20% from optimal setting) Same MIP formulation is solved by LINGO, and we compare the MIP objective function value and run time with DNF. g g g val 1 z nc (1 M i ( z )) g mis MAX: 1gG mval 1 xnp 1 ynp s( x, y, mival , j ) bx by d (mi , j ) 1i G i, j 1 1 0.99 0.98 0.97 0.96 0.95 0.94 0.93 0.92 0.91 0.98 0.94 0.92 0.9 0.88 0.86 Autoimm. Crohn Tickborne Lung Cancer Rheum. 30 25 20 15 10 5 0 Crohn Tickborne Lung Cancer Rheum. Crohn Tickborne Lung Cancer Rheum. Avg Quad-p normalized quality (DNF is 1) 15 Autoimm. Autoimm. Avg Bi-p normalized quality (DNF is 1, the larger the better) Run time Qual. 0.96 0.95 0.96 Avg Bi-p normalized run time (DNF is 1, smaller is better) Run time Qual. Comparisons are for 1 iteration of bi-partitioning and quad-partitioning (i.e. G=2,4) 50 45 40 35 30 25 20 15 10 5 0 23 Autoimm. Crohn Tickborne Lung Cancer Rheum. Avg Quad-p normalized run time (DNF is 1, smaller is better) Experimental Results Run time vs. number of SNPs Rheumatoid arthritis data set is used Randomly chosen 100, 200, 400, 800, 1600, 2300 SNPs y = 8.8x + 3658 25 Run time (ksec) Run time (ksec) 30 20 15 10 5 0 0 500 1000 1500 # of SNPs 2000 2500 16 14 12 10 8 6 4 2 0 y = 0.65x2 - 3x + 135 0 40 80 120 # of patients Run time vs. number of patients Crohn’s disease data set is used No patient pre-clustering. Randomly chosen 30, 60, 90, 120, 144, patients from the data set 160 Conclusions We proposed 0/1 non-linear MIP formulations to identify disease markers. We consider patient clustering to identify most appropriate marker sets The discretized network flow (DNF) method is used to efficiently solve the MIP formulations. A chain structure is used for improving specificity Significant improvements compared to MDR and CSP Also much faster run times Can apply DNF to other computationally challenging bioinfo problems since: DNF can efficiently & near-optimally solve polynomial and Boolean MIPs DNF can also efficiently & near-optimally solve other discrete optimization problems Appendix: Generating Injection Flow MM chain M chain (1,C’) cost Ak cap (1,0) (1,C’) (2,C’) S (1,0) (1,-inf) Ak (1,0) Draining arc To T First a complementary injection flow is generated on a complementary arc Ak, which is 1 if any mismatched marker for NPz is selected Ak and Ak are coupled by a draining arc. …… To T Mi,jval nodes that mismatch NPz If there is flow on Ak Flow towards Ak is shunted to sink Cluster k If there is no flow on Ak Flow will be drained from Ak, and cause injection flow to the chain