Imaging Genetics Multivariate Approaches: Joint Modelling Giovanni Montana 6 June, 2010

advertisement
Imaging Genetics
Multivariate Approaches: Joint Modelling
Giovanni Montana
Statistics Section
Department of Mathematics
Imperial College
London, UK
Email: gmontana@imperial.ac.uk
6 June, 2010
Imaging genetics
Basic data structure: p genotypes and q phenotypes observed on n samples
Subject i
i = 1, . . . , n
(xi1 , xi2 , . . . , xip )
(yi1 , yi2 , . . . , yiq )
A classification of selected imaging genetics studies
I
Small q, small p
I
I
Small q, large p
I
I
Potkin et al. (2009): q = 1 mean BOLD signal; p = 317, 503 SNPs
Large q, small p
I
I
Joyner et al. (2009): q = 4 brain size measures, p = 11 SNPs
Filippini et al. (2009): q = 29, 812 voxels; p = 1 SNP
Large q, large p
I
Stein et al. (2010): q = 31, 622 voxels; p = 448, 293 SNPs
Mass Univariate Linear Models (MULM)
I
A commonly used approach is to model one genotype and one
phenotype at a time:
1. Fit all univariate linear regression models
yj = βjk xk + j = 1, . . . , q,
k = 1, . . . , p
2. Search for a subset of p 0 significant genotypes with indices
{k1 , k2 , . . . , kp0 } ⊂ {1, 2, . . . , p}
with p 0 p
by testing all (p × q) null hypotheses of no association
H0 : βjk 6= 0
3. Correct for multiple testing – control experiment-wise FWER or FDR
I
Possible dependence patterns among genotypes and phenotypes are
ignored at the modelling stage
Multivariate predictive modelling
I
Why modelling multiple genotypes and phenotypes?
I
I
I
A weak effect may be more apparent when other causal effects are
already accounted for
A false signal may be weakened by inclusion in the model of a
stronger signal from a true causal association
A weak effect may be more apparent if multiple phenotypes are
affected
I
The basic strategy is to build a linear regression model that
includes all genotypes (predictors) and all phenotypes (responses)
and then perform variable selection
I
The models covered here are:
I
I
I
Penalised multiple linear regression (any p and q = 1)
Penalised sparse canonical correlation analysis (any p and q)
Penalised reduced-rank regression (any p and q)
Multiple genotypes and one phenotype
The multiple linear regression model with univariate response
I
Fit the multiple linear regression model
y=
p
X
βk xk + k=1
by solving
β̂ ols
( n
)
p
X
X
= argmin
(yi −
xik βk )
β
I
i=1
k=1
Or more compactly, minimise the error function
RSS(β) = (y − Xβ)T (y − Xβ)
I
When n > p, the OLS solution is given by
−1
β̂ ols = XT X
XT y
I
Which genotypes best predict y ?
Penalised multivariate regression for genotype selection
I
One step-approach: fit the multiple linear regression model while
finding a subset of p 0 important predictors with indices
{j1 , j2 , . . . , jp0 } ⊂ {1, 2, . . . , p}
with p 0 p
all having non-zero regression coefficients
I
This can be achieved by fitting a penalised regression model
)
( n
p
X
X
β̂ pen = argmin
(yi −
xik βk )
such that g (β) < t
β
i=1
k=1
I
The function g (β) imposes a constraint on the size of β
I
The complexity parameter t controls the trade-off between the
OLS (unpenalised) solution and the penalised solution
Penalised multivariate regression for genotype selection
Illustration
Genotypes
x1
x2
x3
β1
Phenotype
β3
βp
xp
y
Ridge regression
Pp
2
k=1 βk
<t
I
Ridge regression finds β̂ subject to
I
The problem can be rewritten as
( n
)
p
p
X
X
X
ridge
2
β̂
= argmin
(yi −
xik βk ) + λ
βk
β
i=1
k=1
k=1
or more compactly
RSS(β, λ) = (y − Xβ)T (y − Xβ) + λβ T β
I
I
λ controls the amount of shrinkage
Some properties are:
I
I
I
I
I
−1
Closed-form solution β̂ ridge = XT X + λI
XT y
Useful when the data matrix X is singular and XT X non invertible
Bias-variance trade-off – better predictions
Grouping effect – correlated variables get similar coefficients
No variable selection
Lasso regression
I
P
Lasso regression finds β̂ ols subject to pk=1 |βk | < t or
( n
)
p
p
X
X
X
lasso
β̂
= argmin
(yi −
xik βk ) + λ
|βk |
β
I
I
I
i=1
k=1
k=1
Performs both continuous shrinkage and variable selection
λ controls the amount of sparsity
For instance, with p = 2:
Lasso
Ridge
Example: Lasso regression, regularisation path
p = 100 genotypes and q = 1 phenotype. Only two strong predictors, 98 noise variables
0.4
0.2
0.0
−0.2
Coefficients b
0.6
0.8
1 Phenotype
1
5
10
15
20
25
30
35
40
45
50
55
60
65
Number of selected Xs
70
75
80
85
90
95
100
Elastic net regression
Convex combination of L1 and L2 penalties
I
The elastic net regression solves
β̂ elnet
( n
)
p
p
p
X
X
X
X
= argmin
(yi −
xik βk ) + λ1
|βk | + λ2
βk2
β
i=1
k=1
k=1
I
It retains the benefits of both individual penalties
I
Setting
α=
λ2
λ1 + λ2
the penalty simplifies to
(1 − α)kβk1 + αkβk2
k=1
Solving the penalised regression problem
Uncorrelated predictors
I
For many penalty functions, the solution can be found efficiently by
using component-wide soft-thresholding updates of the OLS
estimate
I
For a given λ:
1. Find the OLS estimates β̂ ols = β1ols , β2ols , . . . , βpols
2. Cycle over the single coefficients and apply the thresholding update
I
I
Component-wise Lasso update:
λ
lasso
ols
ols
β̂k
= sign(β̂k ) |β̂k | −
2 +
(a)+ =
a if a > 0
0 otherwise
Component-wise Elastic net update:
β̂krid+las
1
λ1
ols
ols
=
sign(β̂k ) |β̂k | −
1 + λ2
2 +
Solving the penalised regression problem
Correlated predictors – coordinate descent
I
I
Update each component in turn while holding all the others fixed
Given λ, cycle over j = 1, 2, . . . , p, 1, 2, . . . until convergence:
1. Compute the partial residual
rik = yi −
X
xij βj
j6=k
2. Compute OLS coefficient of these residuals on jth predictor
β̂kols =
n
1X
xik rik
n
i=1
3. Apply soft-thresholding–depending on the penalty, e.g.
λ
lasso
ols
ols
= sign(β̂k ) |β̂k | −
β̂k
2 +
Variable selection in practice
Selection of the sparsity parameter by cross-validation
I
How do we chose the optimal sparsity parameter λ?
I
A common procedure is leave-one-out cross validation (LOOCV)
I
For each value of λ:
1.
2.
3.
4.
Leave one sample out for testing
Use the remaining n − 1 samples for training– fit the model
Compute the prediction error using the test sample
Repeat for all n samples and take the average prediction error
I
The optimal λ minimises the cross-validated prediction error
I
Various search strategies can be used to explore the space Λ
I
In practice, it does not always work well in detecting the true
sparse solution
Variable selection in practice
The stability selection approach (Meinshausen and Buhlmann, 2009)
I
Stability selection is an alternative approach which avoids searching
for an optimal regularisation parameter
I
The procedure works as follow
1.
2.
3.
4.
Extract B subsamples (e.g. of size n/2) from the training data set
For each subsample, fit the sparse regression model
Estimate the probability of each predictor being selected
Select all those predictors whose selection probability is above a
pre-determined threshold
I
Under some assumptions, this procedure controls the expected
number of false positives
I
Unlike LOOCV, it does not heavily depend upon the regularisation
parameter λ
Example: Lasso regression, stability path
p = 100 genotypes and q = 1 phenotype. The true model is y = β1 x1 + β2 x2 + 0.6
0.4
0.2
0.0
Selection Probabilities
0.8
1.0
1 Phenotype
1
2
3
Number of selected Xs
4
5
Latent variable models for one phenotype
Simultaneous dimensionality reduction and variable selection
Genotypes
x1
vrq
uj1
uj2
Phenotype
tj
x3
y
j = 1, . . . , p
ujp
xp
Each tr is a latent variable with some optimal properties, e.g. maximal variance
Modelling multiple genotypes and phenotypes
Multivariate multiple linear regression: Y = XC + E
I
If n were greater than p, C could be estimated by least squares as
Ĉ(R) = (X0 X)−1 X0 Y
I
and Ĉ(R) would be full rank, R = min (p, q)
No real gain – same solutions as with q regression models
Reduced rank regression (RRR)
I
An alternative approach is to impose a rank condition on the
regression coefficient matrix so that
rank(C) ≤ min(p, q)
I
If C has rank r , it can be written as a product of a (p × r ) matrix
B and (r × q) matrix A, both of full rank
I
The RRR model is written
Y = X BA + E,
I
For a fixed rank r , the matrices A and B are obtained by
minimising the weighted least squares criterion
M = Tr (Y − XBA) Γ (Y − XBA)0
for a given (q × q) positive definite matrix Γ
Reduced rank regression (RRR)
Illustration
Reduced rank regression
Solutions
I
The optimal  and B̂ are obtained as
1
 = H0 Γ− 2
1
B̂ = (X0 X)−1 X0 YΓ 2 H
I
H is the (q × r ) matrix whose columns are the first r normalized
eigenvectors associated with the r largest eigenvalues of the
(q × q) matrix
1
1
R = Γ 2 Y0 X(X0 X)−1 X0 YΓ 2
Sparse RRR
Vounou et al. (2010)
Latent variable models for multiple phenotype
Find latent variable pairs (tr , sr ) satisfying some optimal properties
Genotypes
x1
Phenotypes
y1
uj1
vj1
x2
uj2
tj
sj
y2
vj2
vjq
ujp
xp
yq
j = 1, . . . , min(p, q)
Canonical Correlation Analysis (CCA)
I
I
Extract canonical variates (tj , sj ), j = 1, . . . , r with r ≤ min(p, q)
The first pair of vectors u1 and v1 maximizes
ρ1 = cor(Xu1 , Yv1 ) = cor(t1 , s1 )
and is found by solving
(u1 , v1 ) =
argmax p
kuk=1,kvk=1
I
where Sxx and Sxy are sample covariance matrices
By construction:
I
I
I
I
I
u0 Sxy v
u0 Sxx uv0 Syy v
canonical correlations are ordered, with ρ1 ≥ ρ2 . . . ≥ ρr
cov(ti , tj ) = 0 and cov(si , sj ) = 0 for all i 6= j
cov(ti , sj ) = 0 for all i 6= j
When n < min(p, q) some regularisation is needed
CCA is a special case of RRR
Example: sparse RRR (1/3)
p = 100, q = 100, (y1 , y2 ) depend on (x1 , x2 ) (red) and (y3 , y4 ) depend on (x3 , x4 ) (green)
Genotypes selected from the 1st rank:
0.6
0.4
0.2
0.0
Selection Probabilities
0.8
1.0
100 Phenotypes, 5 selected
0
1
2
3
Number of selected Xs
4
5
Example: sparse RRR (2/3)
Genotypes selected from the 2nd rank:
0.6
0.4
0.2
0.0
Selection Probabilities
0.8
1.0
100 Phenotypes, 5 selected
0
1
2
3
Number of selected Xs
4
5
Example: sparse RRR (3/3)
Genotypes selected from the 3rd rank:
0.6
0.4
0.2
0.0
Selection Probabilities
0.8
1.0
100 Phenotypes, 5 selected
0
1
2
3
Number of selected Xs
4
5
Association studies using penalised regression
I
Hoggart et al. (2008)
I
I
I
I
Wu et al. (2009)
I
I
I
I
Propose a penalised likelihood approach – equivalent to Lasso
Use a stochastic search maximisation algorithm – not as efficient as
coordinate descent
Propose an approximation for type-I error
Propose a sparse logistic regression approach for case-control studies
Use coordinate descent to compute the sparse solution
Include two and higher-order interactions after marginal effects have
been detected
Vounou et al. (2010)
I
I
I
Use sparse regression with multiple phenotypes (sparse RRR)
Tailored for imaging genetics studies
Perform simulation studies to assess statistical power
Statistical power comparison
A Monte Carlo simulation framework (Vounou et al., 2010)
I
Generate an entire population P of 10k individuals
I
I
I
I
Use a forwards-in-time simulation approach (FREGENE)
Reproduce features observed in real human populations
Genotypes coded as minor allele SNP-dosage
Generate B Monte Carlo data sets of sample size n each:
1. Randomly sample n genotypes x from the population P
2. Simulate the n phenotypes y from a multivariate normal distribution
calibrated on real data (ADNI data base)
3. Induce an association according to an additive genetic model
I
I
I
p between 1000 to 40, 000
10 predictive SNPs with small marginal effects
q = 111 with 6 true responses
Genotype simulation
Linkage disequilibrium patterns
SNPLDCoef
f
i
c
i
ent
s
Phenotype simulation
q = 111 ROIs obtained from the GSK CIC Brain Atlas using ADNI images
ROICor
r
el
at
i
onCoef
f
i
c
i
ent
s
SNP sensitivity with n = 500
SNP sensitivity with n = 1000
Large p
Ratio of SNP sensitivities (sRRR/MULM) as a function of the total number of SNPs
References I
Filippini, N., Rao, A., Wetten, S., et al. (2009). Anatomically-distinct genetic
associations of APOE epsilon4 allele load with regional cortical atrophy in
Alzheimer’s disease. NeuroImage, 44(3):724–8.
Hoggart, C., Whittaker, J., De Iorio, M., and Balding, D. (2008).
Simultaneous Analysis of All SNPs in Genome-Wide and Re-Sequencing
Association Studies. PLoS Genet, 4(7).
Joyner, A. H., Roddey, J. C., Bloss, C. S., et al. (2009). A common MECP2
haplotype associates with reduced cortical surface area in humans in two
independent populations. PNAS, 106(36):15475–15480.
Meinshausen, N. and Buhlmann, P. (2009). Stability selection. Annnals of
Statistics.
Potkin, S. G., Turner, J. a., Guffanti, G., et al. (2009). A genome-wide
association study of schizophrenia using brain activation as a quantitative
phenotype. Schizophrenia bulletin, 35(1):96–108.
Shen, L., Kim, S., Risacher, S. L., Nho, K., et al. (2010). Whole genome
association study of brain-wide imaging phenotypes for identifying
quantitative trait loci in MCI and AD: A study of the ADNI cohort.
NeuroImage, pages 1–13.
References II
Stein, J. L., Hua, X., Lee, S., et al. (2010). Voxelwise genome-wide association
study (vGWAS). NeuroImage.
Vounou, M., Nichols, T., and Montana, G. (2010). Discovering genetic
associations with high-dimensional neuroimaging phenotypes: a sparse
reduced-rank regression approach. NeuroImage (under revision).
Witten, D., Tibshirani, R., and Hastie, T. (2009). A penalized matrix
decomposition, with applications to sparse principal components and
canonical correlation analysis. Biostatistics, 10(3):515.
Wu, T., Chen, Y., Hastie, T., Sobel, E., and Lange, K. (2009). Genome-wide
association analysis by lasso penalized logistic regression. Bioinformatics,
25(6):714.
Download