Multivariate Approaches: Joint Modeling of Imaging & Genetic Data Giovanni Montana

advertisement
Multivariate Approaches: Joint Modeling of Imaging &
Genetic Data
Giovanni Montana
Statistics Section
Department of Mathematics
Imperial College
London, UK
10 June, 2012
1 / 86
1.
The Univariate Approach: A Brief Review
2.
Multivariate Models for Voxelwise GWAS
3.
Comparative Power Assessment
4.
ADNI Case Study I
5.
Multivariate Models for Voxelwise Pathways GWAS
6.
ADNI Case Study II
7.
Conclusions
2 / 86
The Univariate Approach: A Brief Review
1.
The Univariate Approach: A Brief Review
2.
Multivariate Models for Voxelwise GWAS
3.
Comparative Power Assessment
4.
ADNI Case Study I
5.
Multivariate Models for Voxelwise Pathways GWAS
6.
ADNI Case Study II
7.
Conclusions
3 / 86
The Univariate Approach: A Brief Review
Association Mapping with Unrelated Individuals
Subject i (i = 1, . . . , n)
(xi1 , xi2 , . . . , xip )
(yi1 , yi2 , . . . , yiq )
For instance, p = 600, 000 SNPs, and q = 2, 000, 000 vocels.
The goal is to identify markers highly predictive of all or phenotypes
4 / 86
The Univariate Approach: A Brief Review
Mass Univariate Linear Modelling (MULM)
Genotypes
. Fit all (p × q) linear
regression models
1
yj = βjk xk + ϵ
. Test all (p × q) null
hypotheses
x1
β1 = 0
x2
β2 = 0
yj
2
H0 : βjk = 0
. Correct for multiple
testing, e.g. control
FWER or FDR
4. Rank by p-values
Univariate
Phenotype
x3
β3
x4
β4 = 0
j = 1, 2, . . . , q
3
xp
βp
5 / 86
The Univariate Approach: A Brief Review
MULM: Properties
Genotypes
x1
. Linear genotype-phenotype
relationship
1
β1 = 0
. Ranks all possible
genotype-phenotype pairs
2
x2
x3
x4
β2 = 0
Univariate
Phenotype
yj
β3
j = 1, 2, . . . , q
. Ignores dependences
among genotypes
3
. Ignores dependences
among phenotypes
4
β4 = 0
. Massive multiple testing
problem which should
account for dependence
patterns
5
xp
βp
6 / 86
The Univariate Approach: A Brief Review
Example: APOE4 in Alzheimer’s Disease
Filippini et al 2009
APOE4 SNP: Homozygote of minor allele is 2, heterozygote is 1 and
homozygote of major allele is 0
n = 83 and q = 30k voxels (Gray Matter Volume in 15 ROIs)
Mean GMV reduced by 14% in the red area for homozygotes compared
to non-carriers
7 / 86
Multivariate Models for Voxelwise GWAS
1.
The Univariate Approach: A Brief Review
2.
Multivariate Models for Voxelwise GWAS
3.
Comparative Power Assessment
4.
ADNI Case Study I
5.
Multivariate Models for Voxelwise Pathways GWAS
6.
ADNI Case Study II
7.
Conclusions
8 / 86
Multivariate Models for Voxelwise GWAS
A Multivariate Regression Approach
. For each univariate
phenotype yj , fit the
multiple linear regression
model
1
yj =
p
∑
βk xk + ϵ
Genotypes
x1
β1 = 0
x2
β2 = 0
yj
k=1
. Solve the OLS problem
after imposing a penalty
2
Univariate
Phenotype
x3
β3
x4
β4 = 0
j = 1, 2, . . . , q
P(β) < c
where β = (β1 , . . . , βp )
. Detect genetic factors
influencing each one of the
q univariate phenotypes
3
xp
βp
9 / 86
Multivariate Models for Voxelwise GWAS
Selected Penalties
Lasso penalty:
β̂ = argmin
β
▶
▶
{ n
∑
(yi −
i=1
p
∑
k=1
2
xik βk ) + λ
p
∑
}
|βk |
k=1
The l1 penalty ensures that some coefficients will be exactly zero
The λ parameter controls the number of non-zero coefficients
Other penalties have been proposed to impose some structure while
performing variable selection, e.g.
▶
▶
Elastic net (l1 and l2 penalties)
Group lasso
10 / 86
Multivariate Models for Voxelwise GWAS
Lasso Regression: an Illustration
p = 100 genotypes, y linearly depends only on (x1 , x2 )
0.4
0.2
0.0
−0.2
Coefficients b
0.6
0.8
1 Phenotype
1
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
Number of selected Xs
11 / 86
Multivariate Models for Voxelwise GWAS
Fast Parameter Estimation
Often no closed-form estimators are available
When predictors are uncorrelated, use soft thresholding
{
}
. Find the OLS estimates β̂ ols = β ols , β ols , . . . , β ols
p
1
2
2. Cycle over the coefficients and apply a thresholding update:
)
(
λ
lasso
ols
ols
β̂k = sign(β̂k ) |β̂k | −
2 +
1
{
where
(a)+ =
a if a > 0
0 otherwise
With correlated predictors, use coordinate descent
▶
▶
Fast iterative algorithm also based on soft-thresholding
Can be derived for a large class of penalties
12 / 86
Multivariate Models for Voxelwise GWAS
Penalised Regression on Latent Factors
For univariate phenotypes
. High number of correlated
predictive markers
1
Genotypes
x1
x2
uh1
uh2
Phenotype
th
x3
. Assume the existence of
latent factors, i.e. hidden
factors that have high
predictive power
2
yj
h = 1, . . . , p
. A latent factor is a linear
combination of SNPs
3
. Estimate the contribution
of each SNPs on each
factor while enforcing that
only a subset of SNPs
have non-zero weights
4
uhp
xp
13 / 86
Multivariate Models for Voxelwise GWAS
Penalised Regression on Latent Factors
For multivariate phenotypes
Genotypes
x1
Phenotypes
y1
uh1
x2
y2
th
sh
vh2
h = 1, . . . , min(p, q)
uhp
yq
xp
14 / 86
Multivariate Models for Voxelwise GWAS
Reduced-Rank Regression
If C is the (p × q) matrix of regression coefficients, then Y = X C + E
When C has rank r < min(p, q), the model can be written as
Y = X BA + E
Each one of the r ranks captures a different causal effect
15 / 86
Multivariate Models for Voxelwise GWAS
Sparse Reduced-Rank Regression (sRRR)
Vounou et al. (2010, 2012)
Simultaneous genotype and phenotype selection is achieved by
imposing penalties on A and B
16 / 86
Multivariate Models for Voxelwise GWAS
Rank-1 sRRR
The rank-1 model is:
Y
=
X
b
a
+
E
(n × q)
(n × p) (p × 1) (1 × q)
(n × q)
The sparse coefficients b and a are found by solving
{ [
]
}
b̂, â = argmin Tr (Y − Xba) Γ (Y − Xba)′ + λa Pa (a) + λb Pa (b)
b,a
where Γ is a given q × q positive definite matrix, e.g. 1q
Lasso penalties can be used for both genotypes and phenotypes
Pa (a) =
q
∑
j=1
|aj |
Pb (b) =
p
∑
|bj |
j=1
but many other penalties are also possible to impose more structure
17 / 86
Multivariate Models for Voxelwise GWAS
Coordinate Descent Algorithm for Rank-1 sRRR
. Initialise
1
▶
▶
a0 such that a0 a′0 = 1
′
b0 such that b0 b0 = 1
. Iterate until convergence
2
▶
Update and normalise
(
)
λb
′
0′
b̂ = sign(X Ya ) |X Ya | −
2 +
′
▶
0′
Update and normalise
)
(
λa
′
′
â = sign(b0 X′ Y) |b0 X′ Y| −
2 +
▶
Set a0 = â and b0 = b̂
18 / 86
Multivariate Models for Voxelwise GWAS
SNP Ranking using k-fold Cross Validation
For a given parameter λ
1
.
.
.
.
samples
1
n/k
test
n/k
2
K
training
test
n/k
training
n/k
training
.................
training
n/k
test
n
Average Prediction Error
. Repeat for all λ in [λmin , λmax ]
2. Select λ∗ corresponding the the smaller average prediction error
1
19 / 86
Multivariate Models for Voxelwise GWAS
SNP Ranking using Data Resampling
Resampling Without Replacement
.
.
0.4
λ
1
random
random
set
random
set
random
set
random
set
random
random
set
set
training
set
SNP selection
probabilities
0.3
2
B
0.2
1
.
.
.
For a given parameter
0.1
samples
0.0
n
1
2
3
4
5
6
7
8
. Repeat for all λ in [λmin , λmax ]
2. Select the SNPs having selection probability greater than threshold
9
10
1
20 / 86
Multivariate Models for Voxelwise GWAS
Illustration: sRRR with Data Resampling
Selected SNPs from Rank 1
1.0
100 Phenotypes, 5 selected
0.8
p = 100 genotypes
0.6
First signal:
(y1 , y2 ) on (x1 , x2 )
0.2
0.4
Second signal:
(y3 , y4 ) on (x3 , x4 )
0.0
Selection Probabilities
q = 100 phenotypes
0
1
2
3
4
5
Number of selected Xs
21 / 86
Multivariate Models for Voxelwise GWAS
Illustration: sRRR with Data Resampling
Selected SNPs from Rank 2
0.6
0.4
0.2
0.0
Selection Probabilities
0.8
1.0
100 Phenotypes, 5 selected
0
1
2
3
4
5
Number of selected Xs
22 / 86
Multivariate Models for Voxelwise GWAS
Illustration: sRRR with Data Resampling
Selected SNPs from Rank 3
0.6
0.4
0.2
0.0
Selection Probabilities
0.8
1.0
100 Phenotypes, 5 selected
0
1
2
3
4
5
Number of selected Xs
23 / 86
Comparative Power Assessment
1.
The Univariate Approach: A Brief Review
2.
Multivariate Models for Voxelwise GWAS
3.
Comparative Power Assessment
4.
ADNI Case Study I
5.
Multivariate Models for Voxelwise Pathways GWAS
6.
ADNI Case Study II
7.
Conclusions
24 / 86
Comparative Power Assessment
Power Studies
. Simulation of genotypes in a human population
1
▶
▶
▶
▶
Forwards-in-time simulation with FREGENE (Hoggart et al, 2007)
Data is simulated realistically - evolutionary parameters are controlled
N = 10k individuals
P = 40k SNPs
. Repeated sampling from the population
2
▶
Control study design
⋆
⋆
▶
Simulate phenotypes
⋆
⋆
⋆
▶
Sample size n
Total number of design SNPs p
q = 111 ROIs (GSK Brain Atlas)
From each ROI, simulated a mean modulated GM value
Simulations calibrated on ADNI data
Induce genetic effects
⋆
⋆
⋆
Randomly select 10 causative SNPs
Induce a reduction in mean GM in 6 ROIs
Genetic effects explain 5% of phenotypic variance in the 6 ROIs
25 / 86
Comparative Power Assessment
Simulated Genotypes
Linkage disequilibrium patterns
SNPLDCoef
f
i
c
i
ent
s
26 / 86
Comparative Power Assessment
Simulated Phenotypes
ROI correlation matrix
ROICor
r
el
at
i
onCoef
f
i
c
i
ent
s
27 / 86
Comparative Power Assessment
Power to Detect Causative SNPs (n = 500)
28 / 86
Comparative Power Assessment
Power to Detect Causative SNPs (n = 1000)
29 / 86
Comparative Power Assessment
Relative Power in Large-scale GWA Studies
Ratio of SNP sensitivities (sRRR/MULM) as a function of the total number of SNPs
30 / 86
Comparative Power Assessment
Atlas-Guided sRRR
K non-overlapping voxel groups (ROIs) g1 , . . . , gK
Group k contain qk voxels
It is common to take ROI averages z1 , . . . , zK as phenotypes
However averaging may reduce the power to detect genetic effects
31 / 86
Comparative Power Assessment
Two Atlas-guided sRRR Models
Group and sparse group selection penalties
. A model to select all voxels
within one or more ROIs
(Group Selection)
1
Voxel Group Selection
. A model to select only subset
of voxels within one or more
ROI (Sparse Group Selection)
2
ROI k
Sparse Voxel Group Selection
32 / 86
Comparative Power Assessment
Penalties
{ [
]
}
b̂, â = argmin Tr (Y − Xba) Γ (Y − Xba)′ + λa Pa (a) + λb Pb (b)
b,a
. Group lasso: select all voxels in a ROI
1
Pa (a) =
K
∑
∥agk ∥2
k=1
. Sparse group lasso: select a subset of voxels within a ROI
2
Pa (a) = m
q
∑
j=1
|aj | + (1 − m)
K
∑
∥agk ∥2
k=1
33 / 86
Comparative Power Assessment
Power to Detect Causative SNPs
34 / 86
Comparative Power Assessment
Power to Detect Signal Voxels
35 / 86
ADNI Case Study I
1.
The Univariate Approach: A Brief Review
2.
Multivariate Models for Voxelwise GWAS
3.
Comparative Power Assessment
4.
ADNI Case Study I
5.
Multivariate Models for Voxelwise Pathways GWAS
6.
ADNI Case Study II
7.
Conclusions
36 / 86
ADNI Case Study I
ADNI: Genetic Data
Samples available for AD’s and MCI (Mild Cognitive Impairment)
patients and healthy controls (CN)
▶
▶
Progressive MCI (P-MCI): those who converted to AD
Stable MCI (S-MCI): those who did not convert to AD
n is the sample size broken down by class (nH and nD )
p total SNPs that survived quality control
Baseline and 24-month follow-up scans were available
Three studies were performed:
Comparison
AD vs CN
P-MCI vs CN
P-MCI vs S-MCI
n
254
260
221
nH
101
107
107
nD
153
153
114
p
322875
309730
304209
37 / 86
ADNI Case Study I
ADNI: Imaging Data
. Data preprocessing:
1
▶
▶
▶
▶
▶
Baseline and 24 month follow-up MR images available (Oct 2010)
Follow-up scans aligned with baseline scans using non-rigid registration
Jacobian determinants were extracted from the resulting deformation
fields and represent the expansion/contraction on a voxel basis
After extracting Jacobian maps for all subjects, they were transformed
to a template (using non-rigid registration) estimated for baseline scans
q = 1, 650, 857 voxel intensities (Jacobian determinants) representing
longitudinal changes corrected for age and sex
. Disease imaging signature extraction (voxel selection):
2
▶
▶
▶
For each comparison, we identified regions of highly discriminative voxels
Penalised linear discriminant analysis (pLDA) with data resampling
Predictive power of disease signature was assessed using SVMs
38 / 86
ADNI Case Study I
Linear Discriminant Analysis (LDA)
LDA finds a linear combination of the voxels intensities that best
discriminates between classes - it finds a projection vector w such that
Yw gives best linear discrimination
We have two classes: H (Healthy) and D (Diseased) - the mean
vectors for H and D, and the overall mean are, respectively
1 ∑
mH =
yi·
nH
i∈H
1 ∑
mD =
yi·
nD
i∈D
1∑
m=
yi·
n
n
i=1
The between-class scatter matrix is
ΣB = (mH − mD )′ (mH − mD )
The within-class scatter matrix is
∑
∑
(yi· − mD )′ (yi· − mD )
(yi· − mH )′ (yi· − mH ) +
ΣW =
i∈H
i∈D
39 / 86
ADNI Case Study I
LDA Solution
The optimal direction vector w solves
max
w
w ′ ΣB w
w ′ ΣW w
This is equivalent to
max w′ ΣB w
w
subject to
w ′ ΣW w = 1
This solution involves all the q voxels, but we want to filter
non-informative voxels out
40 / 86
ADNI Case Study I
Penalised LDA
We assume a diagonal within-class scatter matrix, SW , where
diag(SW ) = (s12 , . . . , sq2 ) and call SB the estimated between-group
scatter matrix
Impose w to be sparse by adding a penalty term, and carry out
constrained maximisation


q


∑
′
max w SB w − λ
sj |wj |
w 

j=1
subject to
w′ SW w = 1
Since the objective function is non-convace, standard convex
optimization methods cannot be used, but we use a
minorization-maximization algorithm
▶
▶
find a concave function that minorizes the objective function
then use convex maximisation
41 / 86
ADNI Case Study I
Parameter Tuning and Image Classification Results
Jasounova at al (2011)
We use a data resampling strategy combined with a non-linear Support
Vector Machine classifier to (a) select the optimal number of voxels,
(b) obtain a sparse image classification
vox is the number of selected voxels using sparse LDA
10-fold cross validated performance measures: accuracy (acc),
sensitivity (sen) and specificity (spe)
Experiment
AD vs CN
P-MCI vs CN
P-MCI vs S-MCI
vox
11394
12664
10593
acc
90.3
86.9
82.1
sen
87.5
81.2
81.5
spe
92.1
90.9
82.9
42 / 86
ADNI Case Study I
AD Imaging Signature
The selected voxels are in yellow (coronal, sagittal and axial view from left to right).
1.0
0.8
0.6
0.4
0.2
0.0
Many informative voxels cluster in the hippocampus and lateral ventricles
Also involved are the temporal lobe, amygdala and caudate nucleus
43 / 86
ADNI Case Study I
P-MCI Imaging Signature
The selected voxels are in yellow (coronal, sagittal and axial views from left to right).
1.0
0.8
0.6
0.4
0.2
0.0
44 / 86
ADNI Case Study I
P-MCI/S-MCI Imaging Signature
The selected voxels are in yellow (coronal, sagittal and axial views from left to right).
1.0
0.8
0.6
0.4
0.2
0.0
45 / 86
ADNI Case Study I
Sample Proximity using 2D Projections
MDS plots using the extracted imaging signatures
46 / 86
ADNI Case Study I
Voxel-wide GWA Analysis Flowchart
Vunou et al (2012)
(a) Multivariate phenotype selection
Penalised LDA with sub-sampling
Full-brain Phenotype Matrix
Ỹ
Disease Status Indicator Vector
z
Phenotype
Sub-matrix
Indicator
Sub-vector
Ỹ (b)
z(b)
b = 1, . . . , B
Cross-validated SVM
Classification
Disease Signature Matrix
Y
(b) Genetic association mapping
Sparse RRR with sub-sampling
Genotype Matrix
X
Genotype
Sub-matrix
Signature
Sub-matrix
X(b)
Y (b)
b = 1, . . . , B
SNP Selection
Probabilities & Ranking
47 / 86
ADNI Case Study I
AD: SNP Selection Probabilities
Figure: Ranks 1, 2 and 3 (from left to right).
48 / 86
ADNI Case Study I
AD: Top Ranked Genes
APOE-ϵ4 (∼ 1) - well know and replicated risk factor
TOMM40 (0.96) - located in close proximity to the APOE gene, it has
also been linked to AD in recent studies
BZW1 (0.8) - no prior implication, but expressed in brain, differentially
expressed in a microarray analysis on a mouse model related to a
neurodegenerative disease (amyotrophic lateral sclerosis)
PDZD2 (0.65) - interact with CST3, which is suspected to be
implicated
YES1 (0.5) - Three SNPs in the genes have high selection probability,
a possible link between this gene and AD suggested in the literature
49 / 86
ADNI Case Study I
P-MCI: SNP Selection Probabilities
Figure: Ranks 1, 2 and 3 (from left to right).
50 / 86
ADNI Case Study I
P-MCI: Top Ranked Genes
APOE-ϵ4 (∼ 1)
TOMM40 (0.59)
RBFOX1 (0.57) - associated to autism, bipolar disorder, mental
retardation and epilepsy
COX7A2L (0.53) - belongs in the AD KEGG pathway and physical
interactions between the key AD risk factor TOMM40 and COX7A2L
have been previously reported
51 / 86
ADNI Case Study I
P-MCI/S-MCI: SNP Selection Probabilities
Figure: Ranks 1, 2 and 3 (from left to right).
52 / 86
ADNI Case Study I
P-MCI/S-MCI: Top Ranked Genes
APOE-ϵ4
MGMT - using the Allen Brain Atlas, we confirmed that this gene is
expressed in the brain regions where the selected voxels mostly lie
Other previously unreported associations
53 / 86
Multivariate Models for Voxelwise Pathways GWAS
1.
The Univariate Approach: A Brief Review
2.
Multivariate Models for Voxelwise GWAS
3.
Comparative Power Assessment
4.
ADNI Case Study I
5.
Multivariate Models for Voxelwise Pathways GWAS
6.
ADNI Case Study II
7.
Conclusions
54 / 86
Multivariate Models for Voxelwise Pathways GWAS
From SNPs to Biological Pathways
Genes act together in
functionally related pathways
Pathways GWAS can reveal
aspects of a disease’s genetic
architecture that would
otherwise be missed when
considering variants individually
Increase power due to the
detection of coordinated small
signals within pathways
Easier biological interpretation
and comparisons across studies
55 / 86
Multivariate Models for Voxelwise Pathways GWAS
The SNPs to Genes to Pathway Mapping Process
Known genes are mapped to known pathways, e.g. KEGG
Many genes do not map to any known pathway (unfilled circles), some genes
may map to more than one pathway.
Genes that map to a pathway are in turn mapped to genotyped SNPs within
a specified distance. Many SNPs cannot be mapped to a pathway since they
do not map to a mapped gene (unfilled squares).
SNPs may map to more than one gene and some SNPs (orange squares) may
map to more than one pathway.
56 / 86
Multivariate Models for Voxelwise Pathways GWAS
Existing Methods for Pathways Selection
Existing methods are fundamentally univariate
▶
SNPs are independently scored
▶
The individual genetic effects are then combined over pathways
For instance, GenGen (Wang et al., 2007):
▶
▶
Rank all genes by assigning the value of the highest-scoring SNP within
500kb of each gene
Assess pathway significance by determining the degree to which
high-ranking genes are over-represented or enriched in a given gene set,
in comparison with genomic background
No methods exist for multivariate quantitative traits
57 / 86
Multivariate Models for Voxelwise Pathways GWAS
Pathways-based Sparse Regression Modelling
We approach the problem differently: we include all the known
pathways in a multivariate regression model so they can compete
against each other
The assumption is that, where causal SNPs are enriched in a pathway,
a regression model that selects all SNPs grouped into pathways will
have increased power
▶
▶
Group all available SNPs into L pathways G1 , . . . , GL
Adopt a group lasso penalty to force group selection,
∑
1
Pb (b) = ||y − Xb||22 + λ
wl ||bl ||2
2
L
l=1
to select only most predictive pathways,
b = {(0, . . . , 0), . . . , (0, . . . , bla , 0 . . . , blb , 0, . . . , 0), . . . , (0, . . . , 0)}
| {z }
| {z }
|
{z
}
G1
Gl
GL
58 / 86
Multivariate Models for Voxelwise Pathways GWAS
Challenges and Solutions
Silver and Montana (2011)
Challenges:
. Pathways often overlap, since many SNPs map to multiple pathways
2. Selection bias due to pathways heterogeneity in size, LD distribution, etc.
1
. Sheer scale of datasets, efficient estimation is a necessity
3
Solutions:
. Expanded design matrix by SNP duplication (non-orthogonal groups)
2. Adaptive pathway weights {w1 , . . . , wL } for bias correction
1
. A fast iterative estimation procedure: block-coordinate descent (BCD+)
algorithm for non-orthogonal groups
3
59 / 86
Multivariate Models for Voxelwise Pathways GWAS
SNPs Duplication in Overlapping Pathways
Three pathways G1 , G2 , G2 and grouped regression coefficients β1 , β2 , β3
60 / 86
Multivariate Models for Voxelwise Pathways GWAS
Pathways Group Lasso with Adaptive Weights
. In order to control for group size the common choice is to use a weight
√
wl = Sl
1
but other factors may bias the group selection process
2. In case of no association and no selection bias, a pathway G should be
l
selected according to a uniform distribution, that is with probability Πl
3. The empirical selection probability is called Π∗
l
4. We propose an adaptive strategy whereby the weights
w = {w1 , . . . , wL } are tuned so that the distance D between Π and Π∗
is minimised, where
D=
∑
l
Π∗l (w) log
Π∗l (w)
Πl
is taken to be the Kullback-Leibler (KL) divergence
61 / 86
Multivariate Models for Voxelwise Pathways GWAS
Fast Parameter Estimation Algorithms
. Block coordinate descent (BCD) has generally used for group lasso
with orthogonal groups
1
. We propose a BCD+ algorithm for non-orthogonal groups due to the
expanded design matrix
2
. The proposed BCD+ algorithm is particularly fast as it relies on a
number of techniques:
3
▶
Taylor approximation of the group lasso penalty
▶
Active sets strategy
▶
Efficient computation of block regression residuals
62 / 86
Multivariate Models for Voxelwise Pathways GWAS
Simulation Studies: ADNI Data, Chromosome 1
Genotypes: ADNI Chr1 33,850 SNPs Genes: GRCh37.p3, Chr1 2,382 genes SNP to gene mapping 20,399 SNPs mapped to 2,096 genes within 10kbp Pathways: 880 Pathways containing 880 dis2nct genes SNP to pathway mapping 8,102 SNPs mapped to 778 pathways remove pathways with < 10 mapped SNPs (130 pathways) remove pathways with iden2cal SNPs (97 pathways) P = 8,078 SNPs mapped to 551 pathways overlap expansion P* = 66,085 SNPs mapped to 551 pathways 63 / 86
Multivariate Models for Voxelwise Pathways GWAS
Number of Pathways per SNP
Frequency distribution of ADNI SNPs by number of pathways they map to
100
90
SNPs are mapped to genes
within 10kbp
8, 078 SNPs and 551
pathways
percentage of SNPs
80
70
60
50
40
30
20
10
0
10
20
30
40
50
60
number of pathways per SNP
64 / 86
Multivariate Models for Voxelwise Pathways GWAS
Genetic Effects
Randomly chose a causative pathway GC
Generate a set S of causal SNPs within GC
Form the set C, of causal pathways that contain all the SNPs in S
Simulate a univariate quantitative phenotype,
∑
y=
ζk xk + ϵ
k∈S
▶
▶
ζk is the allelic effect per minor allele due to causal SNP k
ϵ ∼ N (1, σϵ2 )
The effect size of each SNP k in S, that is
δk = E(ζk xk )/E(y )
65 / 86
Multivariate Models for Voxelwise Pathways GWAS
Simulation Scenarios
Given the real ADNI genotype data, phenotypes are simulated
We control three design factors:
▶
▶
▶
The number of causative SNPs, |S|
Where the causative SNPs are in the pathway
The effect size, δk (same for all causative SNPs)
scenario
(a)
(b)
(c)
(d)
(e)
(f)
|S|
10
3
3
10
3
3
δk
0.005
0.005
0.005
0.001
0.001
0.001
distribution
random from
random from
random from
random from
random from
random from
GC1
GC1
single gene in GC1
GC1
GC1
single gene in GC1
description
|S| large; δk
|S| small; δk
|S| small; δk
|S| large; δk
|S| small; δk
|S| small; δk
large; random distribn
large; random distribn
large; single gene
small; random distribn
small; random distribn
small; single gene
66 / 86
Multivariate Models for Voxelwise Pathways GWAS
BCD+ vs BCD: Computational Speed-ups
BCD: block coordinate descent for non-orthogonal groups
BCD+: improved BCD version with speed-ups
We report on estimation times (seconds)
Computations performed using multi-threading on a single machine
with 8 3.2 GHz processors and 64GB RAM.
sample size
371 (N/2)
743 (N)
P ∗ = 4k
BCD BCD+
7.93
0.17
16.9
0.27
P ∗ = 66k
BCD BCD+
421
1.35
511
2.5
P ∗ = 647k
BCD BCD+
5490
16
6430
30.0
67 / 86
Multivariate Models for Voxelwise Pathways GWAS
Adaptive Weights: Power Advantages
Adaptive weighting scheme vs. standard pathway size weighting
1
|S| = 10; δk = 0.005
0.8
SNPs randomly distributed
across causative pathway GC .
0.6
power
ROC curves illustrating power to
identify at least one causal
pathway in the top 100. Power
is average across 500
simulations.
0.4
0.2
GL − adapted weights
GL − standard weights
0
0
0.05
0.1
false positive rate
0.15
68 / 86
Multivariate Models for Voxelwise Pathways GWAS
Power Comparisons to GenGen
0.6
0.6
0.6
power
1
0.8
power
1
0.8
power
1
0.8
0.4
0.4
0.2
0.2
0.4
0.2
GL
GG
0
0
0.05
0.1
false positive rate
GL
GG
0
0
0.15
0.05
0.1
false positive rate
GL
GG
0
0
0.15
0.05
0.1
false positive rate
0.15
(a) |S| = 10; δk = 0.005; (b) |S| = 10; δk = 0.001; (c) |S| = 3; δk = 0.005;
random
random
random
1
0.8
0.6
0.6
0.6
power
power
1
0.8
power
1
0.8
0.4
0.4
0.2
0.2
0.4
0.2
GL
GG
0
0
0.05
0.1
false positive rate
0.15
GL
GG
0
0
0.05
0.1
false positive rate
0.15
GL
GG
0
0
0.05
0.1
false positive rate
0.15
(d) |S| = 3; δk = 0.001; (e) |S| = 3; δk = 0.005; (f) |S| = 3; δk = 0.001;
random
single gene
single gene
69 / 86
ADNI Case Study II
1.
The Univariate Approach: A Brief Review
2.
Multivariate Models for Voxelwise GWAS
3.
Comparative Power Assessment
4.
ADNI Case Study I
5.
Multivariate Models for Voxelwise Pathways GWAS
6.
ADNI Case Study II
7.
Conclusions
70 / 86
ADNI Case Study II
ADNI-1: Available Samples
Serial brain MRI scans were analyzed from 200 probable AD patients
and 232 healthy elderly controls (CN)
Longitudinal scans available at three time points
AD
CN
Total
Group
AD
CN
Screening
200
232
432
6Mo
165
214
379
12Mo
144
202
346
At screening:
age (years) N male
75.7±7.7
103
76.0±5.0
120
24Mo
111
178
289
N female
97
112
71 / 86
ADNI Case Study II
ADNI-1: Imaging Data Preprocessing and Voxel Filtering
. Individual Jacobian maps were created to estimate 3D patterns of
structural brain change over time - longitudinal maps of tissue change
were spatially normalized across subjects by nonlinearly aligning all
individual Jacobian maps to an average group template minimal
deformation target (MDT)
.2 For each one of the Q ∗ = 2, 153, 231 voxels, we obtain a single
1
real-value measurement that capture the temporal changes at that
voxel by fitting a linear regression model with time as covariate and use
estimated slope as the associated phenotype
. All voxels where the difference in the slopes in AD vs CN is not
significantly difference from zero are removed, while also controlling for
sex and age as covariates - the family-wise error rate is controlled by
using a Bonferroni correction, and Q = 148, 023 voxels are retained
3
72 / 86
ADNI Case Study II
AD Imaging Signature
Mean and Std Dev Maps of Slope Coefficients
Increased expansion of ventricular volumes is clear in all subjects, but is most
marked in AD patients, where ventricular volumes expand by an average 1.2% per
year (whiter regions in left hand subplot).
AD patients show the most variation in structural change over time
73 / 86
ADNI Case Study II
AD Imaging Signature: P-values Map
0
8
16
24
32
P-values (− log10 scale) obtained from voxelwise ANOVA models
The final set of Q = 148, 023 selected voxels with p-values exceeding a
Bonferroni-corrected threshold αB = 0.05/2153231, (− log10 αB = 7.6) are
highlighted in yellow.
74 / 86
ADNI Case Study II
Sample Proximities using 3D Projections
MDS plot illustrating the spread of imaging signatures across samples
1000
500
0
-500
AD
CN
-1000
-1500
1000
500
1000
0
-1000
-500
-1000
2000
0
-2000
-3000
75 / 86
ADNI Case Study II
Mapping SNPs to Pathways
76 / 86
ADNI Case Study II
Pathway Sizes and SNP Overlaps
Pathway size (SNPs)
100
Overlapping pathways per SNP
50000
frequency (SNPs)
frequency (pathways)
80
60
40
20
00
40000
30000
20000
10000
1000
2000 3000 4000
number of SNPs
5000
6000
0
5
10 15 20 25 30 35
number of overlapping pathways
40
77 / 86
ADNI Case Study II
Top 15 Pathways Ranked by the PsRRR Algorithm
Selection Frequencies obtained from 1000 subsamples
Rank
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
KEGG pathway name
π path
Size
Insulin signaling pathway
Vascular smooth muscle contraction
Melanogenesis
Focal adhesion
Gap junction
Huntingtons disease
Purine metabolism
Pyruvate metabolism
Propanoate metabolism
Amyotrophic lateral sclerosis als
Chemokine signaling pathway
Phosphatidylinositol signaling system
Citrate cycle tca cycle
Glycosphingolipid biosynthesis globo series
Alzheimers disease
0.524
0.456
0.331
0.232
0.180
0.155
0.154
0.153
0.152
0.151
0.145
0.138
0.137
0.135
0.127
1517
3236
1638
4009
2350
1980
2896
456
471
865
2769
2067
210
227
2500
78 / 86
ADNI Case Study II
Top 15 SNPs and Genes Ranked by sRRR
SNP and gene ranking performed on the highly-ranked pathways
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
SNP
rs4788426
rs11074601
rs263264
rs13189711
rs680545
rs4622543
rs9896483
rs1052610
APOϵ4
rs1254403
rs4730205
rs889130
rs6973616
rs9906543
rs2229611
SNP RANKING
π SNP
Mapped gene(s)
0.451
0.429
0.411
0.392
0.302
0.290
0.274
0.267
0.251
0.234
0.207
0.174
0.167
0.164
0.163
PRKCB
PRKCB
ADCY8
ADCY2
HK2
PRKCA
PRKCA
PIK3R3
TOMM40 APOE
MYLK
PIK3CG
COL5A3
GNAI1
ACACA
G6PC
Gene
GENE RANKING
π gene
# SNPs
PRKCB
ADCY8
ADCY2
HK2
PRKCA
PIK3R3
MYLK
PIK3CG
COL5A3
GNAI1
ACACA
G6PC
DGKA
CR1
TOMM40
0.451
0.411
0.392
0.302
0.290
0.267
0.234
0.207
0.174
0.167
0.164
0.163
0.160
0.154
0.152
73
69
106
28
99
9
24
9
14
22
23
6
3
21
6
79 / 86
ADNI Case Study II
Known AD Genes Included in the Study
12 out of 30 known genes in Braskie, Ringman, and Thompson, 2011
These genes: (a) map to a KEGG pathway and (b) have a genotyped
SNP within 10kbp.
Implicated gene
Mapped genes in study
TOMM40
ACE
EPHA4
CCR2
APOE
FAS
CHRNB2
EFNA5
LDLR
CR1
GRIN2B
IL8
TOMM40 APOE PVRL2
ACE
EPHA4
CCR2 CCR5
TOMM40 APOE PVRL2
FAS
ADAR CHRNB2
EFNA5
LDLR
CR1 CR2
GRIN2B
IL8
80 / 86
ADNI Case Study II
AD Genes Enrichment Score in Top Pathways
Distribution of AD gene
enrichment scores obtained
when permuting pathway
rankings 100, 000 times.
The AD gene enrichment
score has p-value p = 0.0051.
empirical enrichment score
14000
12000
frequency
The vertical black line
indicates the observed AD
gene enrichment score using
the true pathway rankings
obtained in the study.
16000
10000
8000
6000
4000
2000
0
600
800
1000 1200 1400 1600 1800 2000 2200 2400
AD gene enrichment score
81 / 86
ADNI Case Study II
Biological Relevance
High-ranking, AD endophenotype-associated pathways include those
describing insulin signalling, vascular smooth muscle contraction and
focal adhesion- all known to be implicated in AD biology
Other functions previously associated with AD biology among
high-ranking pathways include those related to focal adhesion, gap
junctions, chemokine signalling and phosphatidylinositol signalling
High ranking genes include a number previously linked in gene
expression studies to β-amyloid plaque formation in the AD brain
(PIK3R3; PIK3CG; PRKCA and PRKCB), and to AD related changes
in hippocampal gene expression (ADCY2, ACTN1, ACACA, GNAI1).
Other high ranking previously validated AD endophenotype-related
genes include CR1, TOMM40 and APOE.
82 / 86
Conclusions
1.
The Univariate Approach: A Brief Review
2.
Multivariate Models for Voxelwise GWAS
3.
Comparative Power Assessment
4.
ADNI Case Study I
5.
Multivariate Models for Voxelwise Pathways GWAS
6.
ADNI Case Study II
7.
Conclusions
83 / 86
Conclusions
Conclusions
Sparse reduced-rank regression model combined with a data
resampling scheme provides a strategy for SNP and phenotype
prioritisation and ranking
Different penalties induce different sparsity patterns and allow prior
knowledge (e.g. about gene-gene interactions or phenotypic structures)
to be easily included in the model
Extensive realistic simulation results show that sRRR is more powerful
than mass univariate linear modelling and other models for pathways
selection
Real studies on Alzheimer’s disease have confirmed known causal
variants and pathways implicated with the AD biology
Other studies on Multiple Sclerosis (not presented here) have also
confirmed that sRRR is a valid approach for neuroimaging genetics
84 / 86
Conclusions
References
1.
Vounou M., Nichols T. and Montana G. (2010) Discovering genetic associations
with high-dimensional neuroimaging phenotypes: a sparse reduced-rank regression
approach. NeuroImage.
2.
Vounou, M. al (2012) Sparse reduced-rank regression detects genetic associations
with voxel-wise longitudinal phenotypes in Alzheimers disease. NeuroImage
3.
Jasounova, E. et al (2012) Biomarker discovery for sparse classification of brain
images in Alzheimers disease. Annals of Computer Vision Association. To appear
4.
Silver, M. and Montana, G. (2012) Fast identification of biological pathways
associated with a quantitative trait using group lasso with overlaps. Statistical
Applications in Genetics and Molecular Biology.
5.
Silver et al (2012) Identification of gene pathways implicated in Alzheimer’s disease
using longitudinal imaging phenotypes with sparse regression. Preprint
85 / 86
Conclusions
Software Availability
Coming up soon ...
Open-source software will be released later this year
R and Python libraries for sRRR model fitting and data resampling
using CUDA for GPU computing
Python library for P-sRRR fitting and resampling using Parallel Python
Python Scripts for SNP-to-Pathways mapping
Email me: g.montana@ic.ac.uk
86 / 86
Download