Selection of Multiple SNPs in Case- Control Association Study Using a

advertisement
Selection of Multiple SNPs in CaseControl Association Study Using a
Discretized Network Flow Approach
Shantanu Dutt, Yang Dai
Huan Ren, Joel Fontanarosa
University of Illinois at Chicago
Outline
 Background: Genome Wide Association Study
 Problem Definition
 Previous Work
 Our Work:
 MIP Formulations
 Discretized Network Flow (DNF) Opt. Method
 DNF Solutions for k-SNP Selection w/
Clustering/Classification
 Experimental Results
 Conclusions
Genetic Association Studies
 Goal: Find markers of variation that reliably distinguish
individuals with a disease from a healthy population
 Single Nucleotide Polymorphisms (SNPs) are the
simplest and most common form of variation in the
human genome.
Each chromosome has one of two alleles for each SNP
Possible Genotypes = {0/0, 0/1, 1/1}
Variations measured at specific SNP loci have been shown
to be associated with numerous traits and diseases.
Person 1
Person 2
SNP
SNP
Person 3
chrom 1
chrom 1
chrom 1
chrom 2
chrom 2
chrom 2
SNP
Genetic Association Studies (contd)
Genomic Variation
Gene, Protein, or Cellular
Alteration/Regulation
Altered Phenotype
- Individual traits (eg height, hair color)
- Causal factors for disease
- Increased risk factors for complex disease
Images:
pdb (ww.rcsb.org)
Robbins and Cotran, 7th Ed 2005
Genetic Association Studies (contd)
 Complex traits cannot be mapped to a single
genetic locus
Multiple interacting genetic influences combine with
environmental factors to produce an outcome
Gene Networks
A
B
...
X
Disease
Environment
Genetic Association Studies (contd)
 Genome Wide Association Study (GWAS):
 Measure a large number of SNPs (typically 500K-1M) across the
genome in a large case-control study (often >1000 patients)
 Results are commonly reported based on individual χ2 values,
ignoring potentially powerful interaction effects
It remains an open computational and statistical
challenge to reliably analyze epistasis, or gene-gene
interactions, in large-scale GWAS.
 Different genetic variations  common complex disease
 Problem Definition: For a given set P of cases and Q of
controls, classify the cases into different clusters and
simultaneously select k significant marker SNPs for them
(those that strongly distinguish these cases from the set Q)
 In this paper, we present a new optimization technique
called discretized network flow (DNF) for the above
problem
Examples of Epistasis Methods
 Combinatorial
 MDR = multifactor dimensionality reduction
 CSP = combinatorial search based prediction
 CPM = combinatorial partitioning method
 Probabilistic
 BEAM = Bayesian Epistasis Association Mapping
 Bayesian partitioning model resolved by Markov Chain Monte Carlo
(MCMC) methods
 megaSNPhunter
 Hierarchical learning algorithm (regression trees)
 Primarily considers local interaction effects
MDR: Ritchie et al, Gen Epid, 2003
CSP: Brinza et al., WABI’06
CPM: Nelson et al, Genome Research, 2001
BEAM: Zhang and Liu, Nature Genetics, 2007
megaSNPhunter: Wan et al, BMC Bioinformatics, 2009
MDR
1. Divide data into training and testing sets
2. Select a set of N factors
3. If (affected/unaffected) > T (e.g. T = 1.0)  high risk; o/w low risk
4. Select model with best misclassification error
5-6. Estimate the model prediction error using the testing data set.
Repeat these steps for each cross validation iteration, and for each
possible combination of factors.
Adapted from Ritchie et al, Gen Epid, 2003
CSP: Combinatorial Methods for Disease Association
Search and Susceptibility Prediction



Risk/resistance factor  multi-SNP combination (MSC)
Problem: Find all MSCs significantly associated with the disease
Cluster C: subset of S with an MSC, S : the original SNP set

d(C) : # of diseased, h(C) : # of non-diseased
 Combinatorial search
Definition: Disease-closure of a multi-SNP combination C is a multi-SNP combination
C’, with maximum number of SNPs, which consists of the same set of disease
individuals and minimum number of non-disease individuals.
 Searches only closed clusters
 Closure of cluster C = C’
 d(C’)=d(C) and h(C’) is minimized
 Avoids checking of trivial MSCs
 Small d(C) implies not looking in subclusters
 Finds faster associated MSCs but still too slow
 Tagging:
 compress the SNP set by extracting most informative SNPs
 restore other SNPs from tag SNPs
 multiple regression method for tagging
Brinza, D., Zelikovsky, WABI’06
Our Work: MIP Formulation
 Notations:
 pi,j(x) (0≤j ≤2): =1 if allele j present on SNP i for individual x;
=0, otherwise.
 Marker mi,jval (val=0,1): mi,j1 means presence of allele j in SNP i
mi,j0 means absence of allele j in SNP i
 Per-case benefitnc function of SNP i and allele j
p ( z)

z 1 i , j
b ( x) | p ( x) 
| nc is # of controls
i, j
 Claim
i, j
nc
 bi,j(x) is consistent with the specificity provided by selecting marker
mi,jpi,j(x)
 When pi,j(x)=1:
bi,j(x)
lower fraction of non-patients have pi,j=1= pi,j(x)
higher fraction of non-patients have pi,j=0= pi,j(x)
 When pi,j(x)=0:
bi,j(x)
higher fraction of non-patients have pi,j=1= pi,j(x)
MIP Formulation
 Benefit-based case-pair similarity metric
s( x, y, mival
,j ) 
(bi , j ( x))  (bi , j ( y)) if pi , j ( x)  pi , j ( y)  val
 Otherwise (indicating mx,yval is not a common
marker for patients x and y)
 MIP formulation for selecting one marker set for
all patients:
val
MAX:  mval 1 x  np 1 y  np s( x, y, mival
)

d
(
m
,j
i, j )
i, j
S.T.
0
1
d
(
m
)

d
(
m
 j 1 i, j
i , j )  1 SNPs i
3
0
1
(
d
(
m
)

d
(
m
 i, j i, j
i , j ))  k
•d(mi,jval) =1 if maker mi,jval is selected; np is the # of patients/cases
•At most k markers will be selected
• Linear MIP; MIP can be solved with commercial tools such as
CPLEX/LINGO. However, very time consuming.
•The similarity definition ensures that only common markers among
patients will be selected.
MIP Formulation (contd)
 Issue 1:
 Genetic reasons of a disease for diff. patient sets (e.g., w/
different ethnicity) can be different.
 Hence, selecting only one marker set is not appropriate (it
artificially forces one marker set on the entire patient pop).
 Solution: Simultaneously cluster patients and select
different markers for different clusters
g
g
g
val
MAX: 1g G  mval 1 x np 1 y  np s( x, y, mival
)

b

b

d
(
m
,j
x
y
i, j )
i, j
S.T.


3
g
0
g
1
d
(
m
)

d
(
m
i, j
i , j )  1 SNPs i and clusters g
j 1
g
b
 1 x
x
1 g G

g
0
g
1
(
d
(
m
)

d
(
m
i
,
j
i
, j ))  k
i, j
• bxg: if x is in cluster g dg(mi,jval): if marker mi,jval is selected for cluster g.
At most G cluster will be generated.
• Cubic MIP!
MIP Formulation (contd)
 Issue 2: the sum of benefit is not consistent with the
specifity of a set of markers
 Essentially, the previous formulation will select five common
markers with the highest benefit.
 However, it is not optimal.
Mismatch
marker 1
Mismatch
marker 2
Mismatch
marker 3
Mismatch
marker 4
Control set
 Individually, marker 1 and 2 provide larger speicfity than marker 3
and 4 (mismatch more controls).
 However, the mismatch set of marker 1 and 2 have larger overlap.
Select marker 3 and 4 as the marker set gives
overall higher specifity
MIP Formulation (contd)
 Adding accurate specifity terms to the obj. func. for each control z :
 (1  M ( z))  g
1i G
i
mis
 Mi(z) : whether control z matches the marker set selected for cluster i;
Mi(z) is the mod 2 addition (Boolean OR) of various 0/1 vars
 gmis: objective function gain for mismatching a control.
g
g
g
val
MAX: 1gG mval 1 xnp 1 ynp s( x, y, mival
)

b

b

d
(
m
,j
x
y
i, j )
i, j
1 z nc  (1  M i ( z ))  g mis
1i G
Final objective function
 At least cubic MIP (if G <= 3)
gmis is determined so that specificity and sensitivity are given
the same weight.
 Average gain for a patient matching a marker set:
2kbavgα(np/G), where np is the number of patients, and G is the
number of groups.
 gmis =2kbavgα(np/G)*np/nc
Discretized Network Flow (DNF)
 Standard min-cost network flow
Find a min cost way to send a certain amount of flow
from the source node (S) to the sink node (T).
Capacity cost
(2,0)
(1,4)
(2,0)
(2,0)
s
MEA
f=1
(1,1)
(2,0)
(2,0)
(2,0)
Valid flow
T
(1,2)
Invalid flow
 Solves certain LP problems (continuous solns)
 Some discrete constraints have to be staisfied in
order to solve discrete opt. problems like MIP
One such constraint: Mutually exclusive arc set (MEA):
At most one arc of a subset of arcs in this set can have
flow on it.
Discretized Network Flow (contd)
 Satisfying MEA requirements
 Adding a flow-amount-independent cost C’ to each arc in the set,
 A constant C’ cost is incurred whenever there is flow on the arc
c
Standard
linear flow cost
With C’ cost c
Cap(e)
f
C’
C’
C’
C’
C’
MEA sets
Cap(e)
f
C’inv: total C’-related cost for invalid
flow
C’val: total C’-related cost for valid
flow C’inv≥C’val+C’
Discretized Network Flow (contd)
 Determining C’:
 In the standard network flow graph
Heuristically
select a valid
flow
& determine its
cost Cval
Without C’
Cinv Cvalmin Cval
Obtain min-cost
flow of cost Cinvmin
w/o discretization
constraints
With C’
Cvalmin Cval+ Cinv+
+C’val C’val C’inv
Set C’=Cval-Cinvmin+1
Since C’inv≥C’val+C’, a valid flow is
guaranteed to have a smaller cost
than any invalid flow.
Theorem [Ren et al.,
ICCAD’08]: A min-cost
flow with C’-costs on MEA
arcs ensures MEA
satisfaction
Discretized Network Flow (contd)
 Discrete network flow has been applied to VLSI CAD
problems [Ren et al., ICCAD’08], [Ren et al., IWLS’08],
[Dutt et al., ICCAD’06]
 Good run time and scalability.
 At least 10x to 60x times faster than CPLEX with similar quality
 Example: determine optimal cell sizes in a circuit under an area
constraint
run time (secs)
 Four sizes available. The number of 0/1 variables is about four times
the number of cells considered.
1500
y = 0.3823x + 8.5251
1000
Run time vs. the number of cells
from [Ren et al., IWLS’08]
500
0
0
1000
2000
3000
# of cells considered
4000
DNF Model for Single-Cluster Marker
Selection
ci,j(x)
(1, -s(x,y,pi,j
Pm
P1
…
…
S
P1
Pm
No connection otherwise
T
(np,0)
cost
cap
f=np
(np*k,0)
p1,1
p1,2
MEA
From S
(np,0)
MEA: only k arcs
SN
can have flow
Px
f=1
p1,3
p1,3 MEA
pN,1 pN,1
pN,3
pN,3
…
f=np*k
p1,1
S1
…
Complete bi-partite
graph with meta arcs
)) if ci,j(x)=ci,j(y)
Py
 Flow through pi,j node in Px means d(mi,jpi,j(x))=1
 Pairwise connection between pi,j nodes ensures the same marker
set is selected for all Px
 The flow cost incurred for selecting a common marker between two
patients is: -s(x,y,mi,jpi,j(x))
To T
Marker Selection for Multiple Clusters
 Use multiple copies of the single cluster network model
 Type 1 invalid flow:
Flow puts P1 in both
cluster 1 and 2
Cluster 1
MEA
P1
P1
P2
P2
P3
P3
P4
Complete
bipartite
P1
P1
MEA
 Type 2 invalid flow:
Flow thru P1 passes thru P2
that is not in the same cluster,
incurring false costs.
P4
S
Choice
nodes
Cluster 2
P2
P2
P3
P3
P4
P4
T
 MEA prevents invalid
flows
 Example valid flow: Puts
patients {1,4} in cluster 1,
and {3,2} in cluster 2.
 For a G clusters will have G copies of
the 2-level compl. bipartite graph; not
all G clusters may be formed
Marker Selection for Multiple Clusters
 Issue: When G is large, the network flow graph become very
complex
 We use iterative bi-partitioning instead
 Much harder bi-part prob than standard
bi-part; bi-part criterion needs to be
 Another run-time reduction
selected simultaneously w/ bi-part!
technique: Patient pre-clustering
 Group patients before using DNF.
Condition for stopping the bipartitioning of a cluster: The
 Greedy iterative grouping method
spec+sens deteriorates
 Initially, each patient is a
subgroup
 Each time merge the two
subgroups with most common
SNP-allele pairs.
 Termination condition: patients in
one group must have at least
Final solution
70% SNP-allele pairs in common.
 Each group is taken as a “meta
Meet termination
patient” in DNF
condition
 Groups opened up after DNF, and
Meet termination
metrics eval. at the individual
condition
level
Chain Structure for Improving Specificity

(1  M i ( z ))  g mis
From S
MEA
MEA
MM chain
cost=-gmis
Cluster i
T
Chain structure for
control z
(cap, cost)
A1
(1,0)
Cluster 1
A2 M chain
Cluster 2
Ag
cost=0
Cluster g
 One chain structure for each controls.
 Two subchains: mismatched (MM) chain and matched (M) chain.
 One injection arc to M subchain from each cluster: A1......Ag.
 Injection flow on arc Ai means z matches the selected marker set of cluster
i (Mi(z)=1).
Chain flow stays on the MM
chain if no injection arc has
flow, and incurs cost of -gmis
Any injection flow causes the MEA
condition to force chain flow into M
chain, and never switch back. Hence,
incur 0 cost.
Experimental Results
 Data set we use
 Crohn’s disease: 144 cases, 243 controls and 103 SNPs
 Autoimmune disorder: 384 cases, 652 controls and 108 SNPs
 Tick-borne encephalitis: 21 cases, 54 controls and 41 SNPs
 Rheumatoid arthritis: 460 cases, 460 controls and 2300 SNPs
 Lung cancer: 322 cases, 273 controls and 141 SNPs
 Rheumatoid arthritis (large): 868 cases, 1194 controls and 5000
SNPs
 Prediction scheme with multiple cluster marker sets
Predict as
Marker Match
Mismatch
healthy
set 1
Test 1
Test 2
Predict
as sick
Mismatch Marker Mismatch
set 2
TP: correctly predicted as sick
FP: falsely predicted as sick
TN: correctly predicted as healthy
FN: falsely predicted as healthy
Sensitivity=TP/(FN+TP)
Specifity=TN/(FP+TN)
Accuracy=(TN+TP)/(FP+TN+FN+TP)
 Machine configurations: 3G cpu, 1G mem, Windows machine.
Experimental Results
 Five-fold cross validation
 K=10 results for Rheum. (large, no comparisons available):
sens: 85; spec: 80; accuracy: 82 ;10 clusters; 21.5 h per training run
Spec.
 Comparisons to MDR:
78.1
56.7 81.9
38%
relatively
120
100
80
60
40
20
0
Autoimm.
Crohn
Tick-borne
Lung
Cancer
Rheum.
MDR(k=5)
DNF(k=5)
DNF(k=10)
Avg
Specifity
87.6
48.8
88.4
120
Sens
100
80
79%
relatively
60
40
20
0
Aut oi mm.
Cr ohn
Ti ckbor ne
Lung
Cancer
Sensitivity
Rheum.
Avg
MDR( k=5)
DNF( k=5)
DNF( k=10)
# of clusters
K=5
K=10
Autoimm. 12
16
Crohn.
12
16
Tickborne
6
6
Lung
cancer
14
16
Rheum
13
14
Experimental Results
 Comparisons to CSP [Brinza commun. 4/09, Brinza et al., WABI’06 ppt:
http://www.cs.ucsd.edu/~dbrinza/cv/present/brinza_wabi06.ppt]
 Leave-one-out validation
 For DNF, 20 runs are performed with randomly chosen left-out individuals
 CSP performs n runs for n individuals (cases+controls)
100
120
95
100
Spec.
85
80
70
Autoimm.
Crohn
Tickborne
60
40
DNF(k=10)
CSP
75
36%
relatively
80
Sens.
83.1
85
2.4%
relatively
90
96.6 71.1
20
DNF(k=10)
CSP
0
Avg
Autoimm.
90.6 76.8
18%
relatively
120
100
80
60
40
20
0
Autoimm.
Crohn
Tickborne
Avg
Sensitivity
Tick-borne
Avg
Geometric mean of
sens. and spec.
DNF(k=10)
CSP
Run time (ksec)
Geometric mean
Specifity
Crohn
70
60
50
40
30
20
10
0
3k
Autoimm.
Crohn
Tickborne
24k 8 times
Avg
DNF(k=10)
CSP
Run time (ksecs, per leave-out run)
Experimental Results
 Leave-one-out validation
120
76.6
Accuracy
100
90.8
80
19%
relatively
60
40
20
0
Autoimm.
Crohn
Tick-borne
Accuracy
Avg
DNF(k=10)
CSP
Autoimm.
18
Crohn.
16
Tick-borne
6
Lung
cancer
17
Rheum
14
Average number
of clusters
Experimental Results
 Comparing to LINGO (<= 20% from optimal setting)
 Same MIP formulation is solved by LINGO, and we compare the MIP
objective function value and run time with DNF.
g
g
g
val
1 z nc  (1  M i ( z ))  g mis
MAX: 1gG mval 1 xnp 1 ynp s( x, y, mival
, j )  bx  by  d (mi , j )
1i G
i, j
1
1
0.99
0.98
0.97
0.96
0.95
0.94
0.93
0.92
0.91
0.98
0.94
0.92
0.9
0.88
0.86
Autoimm.
Crohn
Tickborne
Lung
Cancer
Rheum.
30
25
20
15
10
5
0
Crohn
Tickborne
Lung
Cancer
Rheum.
Crohn
Tickborne
Lung
Cancer
Rheum.
Avg
Quad-p normalized quality (DNF is 1)
15
Autoimm.
Autoimm.
Avg
Bi-p normalized quality (DNF
is 1, the larger the better)
Run time
Qual.
0.96
0.95
0.96
Avg
Bi-p normalized run time (DNF is 1, smaller is
better)
Run time
Qual.
 Comparisons are for 1 iteration of bi-partitioning and quad-partitioning (i.e.
G=2,4)
50
45
40
35
30
25
20
15
10
5
0
23
Autoimm.
Crohn
Tickborne
Lung
Cancer
Rheum.
Avg
Quad-p normalized run time (DNF is 1,
smaller is better)
Experimental Results
 Run time vs. number of SNPs
 Rheumatoid arthritis data set is used
 Randomly chosen 100, 200, 400, 800, 1600, 2300 SNPs
y = 8.8x + 3658
25
Run time (ksec)
Run time (ksec)
30
20
15
10
5
0
0
500
1000
1500
# of SNPs
2000
2500
16
14
12
10
8
6
4
2
0
y = 0.65x2 - 3x + 135
0
40
80
120
# of patients
 Run time vs. number of patients
 Crohn’s disease data set is used
 No patient pre-clustering. Randomly chosen 30, 60, 90, 120, 144,
patients from the data set
160
Conclusions
 We proposed 0/1 non-linear MIP formulations to identify
disease markers.
 We consider patient clustering to identify most
appropriate marker sets
 The discretized network flow (DNF) method is used to
efficiently solve the MIP formulations.
 A chain structure is used for improving specificity
 Significant improvements compared to MDR and CSP
 Also much faster run times
 Can apply DNF to other computationally challenging
bioinfo problems since:
 DNF can efficiently & near-optimally solve polynomial and
Boolean MIPs
 DNF can also efficiently & near-optimally solve other discrete
optimization problems
Appendix: Generating Injection Flow
MM chain
M chain
(1,C’)
cost
Ak
cap
(1,0)
(1,C’)
(2,C’)
S (1,0)
(1,-inf)
Ak
(1,0)
Draining
arc
To T
 First a complementary
injection flow is generated on
a complementary arc Ak,
which is 1 if any mismatched
marker for NPz is selected
 Ak and Ak are coupled by a
draining arc.
……
To T
Mi,jval nodes that
mismatch NPz
If there is flow on Ak
Flow towards Ak is shunted to sink
Cluster k
If there is no flow on Ak
Flow will be drained from Ak, and
cause injection flow to the chain
Download