Modelling Genetic Variation in the Brain Thomas Nichols, PhD Department of Statistics,

advertisement
Modelling Genetic Variation
in the Brain
Thomas Nichols, PhD
Department of Statistics,
Warwick Manufacturing Group
University of Warwick
joint with
Becky Inkster
Institute of Psychiatry
King’s College London
(GSK3β & WNT
pathway VBM)
Maria Vounou,
Giovanni Montana
Statistics Section, Dept.
of Mathematics
Imperial College
(Sparse Reduced Rank
Regression)
Outline
•  Background
–  Structural brain imaging & VBM
–  Genetics
–  “Imaging Genetics”
•  Candidate SNP VBM
•  Multivariate SNP analyses
Neuroimaging Background:
Structural Brain Image Data
•  Morphometry
–  Quantification of shape/volume of
brain structures
•  Traditional Morphometric Analysis
–  Laborious hand-tracing of structures
–  Accurate, but imperfect inter-rater
reliability
•  Voxel Based Morphometry
–  Automated morphometry method
Voxel Based Morphometry
Original MRI → Segment → Warp to Atlas Space → Modulate → Smooth
Subject Space
T1-weighted MRI
Subject Space
Gray Matter
Voxel Based Morphometry
Original MRI → Segment → Warp to Atlas Space → Modulate → Smooth
Modulation
Gives units of
subject GM
volume in
atlas space
Atlas Space
Atlas Space
Atlas Space
Allows
analysis in
common
space while
retaining
individual
differences
T1-weighted MRI
Gray Matter
Modulated
Gray Matter
Voxel Based Morphometry
Original MRI → Segment → Warp to Atlas Space → Modulate → Smooth
Smoothing
•  Accounts for imperfect
registration of individuals to
atlas
–  Even identical twins have
different cortical foldings
–  Exact match impossible
•  Discards fine spatial details in
exchange for reduced noise
Atlas Space
Atlas Space
–  Generally searching for
moderate scale differences
Done!
•  3D image is n=1
–  A single (100,000-dimensional)
phenotypic measurement on 1
individual
Modulated
Gray Matter
Smoothed,
Modulated GM
Genetics Background
•  Genotype
–  The genetic constitution of an organism or cell
–  46 chromosomes in humans
–  23 pairs of homologous chromosomes
•  One each from each parent
•  Gene
–  A series of basepairs (DNA bits) which code
for a trait
–  Four different possible basepairs, the
nucleotides
•  Adenine, Thymine, Cytosine, & Guanine
Genetics Background
•  Single Nucleotide Polymorphisms (SNP)
–  Locations where single base-pair differences bases
have been found in the population
•  SNP Example
–  If some of the population has sequence…
AATGTGATAGCTT
–  And if remaining has…
AATGTGACAGCTT
–  We have found a SNP!
SNP
•  SNP data
–  Homologous chromosomes
–  For each SNP, for each individual: 0, 1 or 2 count
8
Genetics
Background
•  Millions of SNPs
•  Thanks to correlation
(linkage disequilibrium),
only need ≈500k to “tag”
all variation
5
18
16
x 10
2.5
x 10
3,079,843,747 Base Pairs †
2
1.5
1
0.5
0
Number of SNPs per Chromosome
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y
Number of Genes per Chromosome
4500
20,296,765 SNPs *
4000
14
3500
12
3000
10
2500
8
2000
6
1500
4
1000
2
500
0
Number of basepairs per Chromosome
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y
0
32,185 Genes †
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y
†
Genetics Background
•  SNPs vs. genes
–  Each gene often has several variants
–  1 or more (but not many) SNPs typically
needed to identify a gene
–  SNPs may not lie directly on coding portion of
gene
•  Due to linkage disequilibrium (correlation), close is
good enough
•  Non-coding, regulatory region may be causal
Location on
chromosome
Exon
Exon
Exon
Exon
Exon
SNPs
Exon
Imaging Genetics
•  Motivation
–  Brain structure heritable
–  Objective, reproducible phenotype
•  Important in psychiatry
Brain Phenotype
h2
Whole brain volume
0.78
Total gray matter volume
0.88
Total white matter volume
0.85
Glahn, Thompson, Blangero. Hum Brain Mapp 28:488-501, 2007
Thickness of Cortical GM (r2)
–  Current best measures are coarse,
with weak reproducibility
»  e.g. HAM-D (depression), MMSE
(cognition, AD)
–  Sensitive
•  Brain anatomy/function closer to
disease process than other measures
–  Use to collaborate other findings
Thompson et al, Nature Neuro, 4(12):1253-1258,. 2001
Heritability of GM Thickness
(h2 & corrected P-value)
•  E.g. Large WGA finds modest
significance
Use brain imaging to build confidence
in finding
Thompson & Toga, Annals of Medicine 34(7-8):523-36, 2002
Imaging Genetics Menu
Imaging
Candidate ROI
Many ROI
Voxelwise
Genetics
Candidate SNP
Candidate Gene
Genome-wide SNP
Genome-wide Gene
[Filippini et al. 2009]
29,812 voxels
1 SNP
[Joyner et al. 2009]
4 ROIs, 11 SNPs
[Potkin et al. 2009]
1 BOLD ROI
317, 503 SNPs
[Stein et al. 2010]
31,622 voxels
448,293 SNPs
[Hibar et al. 2011]
31,622 voxels
18,044 SNPs
(Jason Stein/Andy Saykin/Bertrand Thirion)
Outline
•  Background
–  Structural brain imaging & VBM
–  Genetics
–  “Imaging Genetics”
•  Candidate SNP VBM
•  Multivariate SNP analyses
GSK3β Background
•  High heritability of depression (Kendler et al.
2006; Sullivan et al., 2000).
•  Meta-analytical evidence from MRI studies for a
role of hippocampal integrity in depression
(Campbell et al., 2004).
•  There is strong genetic regulation of
neurodevelopment (reviewed by Wilson and
Rubenstein, 2000; O Leary et al., 2002).
•  The Wnt signaling pathway is one network of
proteins that play a role in embryogenesis
•  GSK3β plays a key role in Wnt pathway
Wnt Signaling Pathways
regulates the development of the hippocampus
GSK HiTDIP Study
•  Major Depressive Disorder (MDD) Association Study
–  “High Throughput Human Disease Specific Targets”
–  7,000 SNPs covering 2,000 genes with tractable targets
–  1000 cases, 1000 controls
•  Imaging Subset
–  200 cases, 200 controls (of 1000 & 1000) scanned with
anatomical MRI protocol
–  Optimized VBM with SPM5 s segmentation tool
–  324 images passed QC
•  366 subjects data delivered
•  42 subjects set aside
(clinical exclusion, pathologies or failed segmentation)
•  Glycogen synthase kinase 3β (GSK3β)
–  Plays key role in WNT pathway, influential in development
Modelling Candidate SNPs
•  Mass Univariate Modelling
–  Fit same univariate linear model at each voxel
•  Quantitative Trait Multiple Regression
–  Linear model fit at each voxel
•  Regressors
–  Genetic
–  Group (Case/Control)
–  Demographic / nuisance variables
0
1
SNP Count
2
Xj
Gray Matter Volume
•  Dominant
Gray Matter Volume
Y
Y
Gray Matter Volume
•  Recessive
Gray Matter Volume
SNP Models for Gray Matter Data
Y
0
1
SNP Count
2
0
1
SNP Count
2
Y
•  Additive
•  Genotypic
0
1
SNP Count
2
Xj
Xj
Xj
Mass Univariate Modelling
Genetic Effects
•  Concerns about leverage/influence
–  100’s not 1000’s of subjects
–  Rare SNP can make a few
subjects very influential
•  An ever-greater problem
as sample size shrinks
Y
Gray Matter Volume
•  100 subjects + 10% MAF
→ 1 subject with rare
genotype expected!
0
1
Allele Count
2
Xj
Mass Univariate Modelling
Genetic Effects
•  Ad hoc solution
–  If expected rare genotype frequency <10%
merge genotypes
•  If MAF > 0.31 (=√0.1)
–  2DF Genotypic model
•  Additive + Nonadditive Parameterization
–  Additive
[ -1
0 +1 ]
–  Nonadditive [ -1/2 +1 -1/2 ]
(orthogonalize w.r.t. additive regressor * )
•  If MAF < 0.31
–  Use dominant/recessive model
tested
not tested
Mass Univariate Modelling
Nuisance Effects
•  Age & Gender
–  Substantial normal variation in GM w/ Age
•  Total Gray matter
–  Accounts for differences in head size
–  Discounts global changes to find localized changes
•  Scanner (Pre/Post Upgrade)
–  Upgrade 2/3-through study altered image contrast
•  Medication (Yes/No, for cases only)
–  Neurotrophic effects reported for some Rx
Model Diagnosis for Imaging
•  Why bother?
–  Largish n, continuous data, Central Limit
Theorem should carry us
–  Type I Error generally OK due to robustness
of t-test/ANOVA-like models
Failed GM
segmentation
due to data
formatting
error
•  Sensitivity!
–  Decreased sensitivity due to inflated error
variance σ
–  Suboptimal sensitivity due to non-normality
•  How!?
–  100,000 voxels, 400 subjects
–  100,000 QQ plots to look at all 40 million
data points?
Warping
artefacts seen
in modulated
GM
Model Diagnosis for Imaging
Model Summaries
•  Model summaries
–  Images of diagnostic stats
•  Scan summaries
–  Vectors of ad hoc measures
•  Dynamic graphical tool
–  Explore many summaries
simultaneously
–  Easily jump from
summary image to plots, from
plots to residual images
•  End Result
–  Swiftly localize and
understand problems
Luo & Nichols NeuroImage 19:1014–1032, 2003
Statistic
Assesses
Null Distn
Cook-Weisberg Var(εi) = σ2
Chi-Squared
Shapiro-Wilk
ε ~ Normal
(tabulated)
Outlier Count
Artifacts
Binomial
Std. Deviation
Artifacts
Scan Summaries
Summary
Interpretation
Global intensity
Whole-brain signals
or artifacts
Outlier Count
Artifacts
Any preprocessing
parameters
e.g. head size
Experimental
predictors
Suggests cause of
artifacts
For investigating
mismodelled signal
in residuals
http://go.warwick.ac.uk/tenichols/software
Model Diagnosis w/ SPMd
Model Summaries
Scan Summaries
Model Detail
Scan Detail
Outline / Motivation
•  Data
–  Intro to Voxel Based Morphometry data
•  Model
–  Quantitative trait regression w/ Mass Univariate Model
•  Diagnosis
–  100,000 Q-Q plots anyone?
•  Inference
–  Cluster size under nonstationarity
–  Candidate screening procedure
•  Results
–  GSK3β in MDD
•  Future Directions
Inference On Images:
Voxel-wise vs. Cluster-wise
•  Voxel-wise
–  Reject Ho, point-by-point, by statistic magnitude
•  Cluster-wise
–  Define contiguous blobs with arbitrary threshold uclus
–  Reject Ho for each cluster larger than kα
uclus
space
Cluster not
significant
kα
kα
statistic
image
Cluster
significant
Cluster Inference & Stationarity
•  Cluster-wise preferred over voxel-wise
VBM:
Image of
FWHM
Noise
Smoothness
–  Generally more sensitive
Friston et al, NeuroImage 4:223-235, 1996
–  Spatially-extended signals typical
•  Problem w/ VBM
–  Standard cluster methods assume
stationarity, constant smoothness
–  Assuming stationarity, false positive clusters
will be found in extra-smooth regions
–  VBM noise very non-stationary
•  Nonstationary cluster inference
–  Must un-warp nonstationarity
–  Reported but not implemented
•  Hayasaka et al, NeuroImage 22:676– 687, 2004
–  Now available as SPM toolbox
•  http://fmri.wfubmc.edu/cms/software#NS
Nonstationary
noise…
…warped to
stationarity
Inference in Imaging Genetics:
Creeping Multiple Testing Problem
•  Even just with candidate analyses,
Can end up searching over…
–  Genes
–  SNPs within a gene
–  Space (voxels or clusters)
–  Different contrasts on GLM
•  Main effect? By clinical subgroup? Interactions?
•  Can quickly lose confidence in results
–  E.g. 0.005 FWE-corrected is great…
…Unless it s the 25th statistic image you ve seen
Inference in Imaging Genetics
Multiple Testing Strategy
•  Define strict primary outcome
–  For given gene, use single SNP
•  Best (large) association study significance, otw
•  Best nonsynonymous exonic available, otw
•  Best 5 intronic available
–  For each SNP, only consider main effect of gene
•  If fitting gene x group interaction, test for average effect
– 
– 
Any association is more likely than a disease-specific association
Even if disease-specification association, opposing sign of effect unlikely w/ VBM
–  1-number summary per gene
•  Minimum nonstationary cluster FWE-corrected P-value for association (1 DF
F-stat)
–  Bonferroni correction for number of genes
•  Primary outcomes then have strong FWE control
–  Over brain, over genes
–  (1-α)100% confidence of no false positives anywhere
•  Secondary outcomes
–  Interactions, sub-group results
–  Use same FWE-inferences, but mark as post-hoc
Results: Model Diagnosis
Outlier Detection with Shapiro -Wilk
-log10 P Shapiro-Wilk
R
Two outliers
Mean Smoothed Mod. GM
Results: Model Diagnosis
Characterising Outliers with Standardized Residual Images
R
R
Subject 193
Subject 194
Outlier
Subject 195
Note: Compare standardized residuals to +/-6.128
Subject 194 raw T1
(Bonferroni for 324 images, each with 173,823 voxels, at
each a 2-sided test)
Severe enlargement of inferior
horn of lateral ventricle
Results:
Outlier
Exploration
Subject 194
Outlier
Inferior Horn of
Lateral Ventricle
In most of us, this is
a pencil-lead-thick
fluid-filled space
In this subject it was
a pencil-thick
Clinical collaborator
verified it as
abnormal & subject
was removed
Randomly
Selected
Control
GSK3β and Structural Differences
R
L
2 SNPs in strong
linkage disequilibrium
showed significant
associations with GM
differences in MDD
patients:
rs6438552
rs12630592
Brain regions where
SNP clusters show colocalization.
GSK3β-Gray Matter association in bilateral superior temporal
gyrus (STG) and right hippocampus
Inkster, B., Nichols, T. E., Saemann, P. G., Auer, D. P., Holsboer, F., Muglia, P., & Matthews, P. M. (2009).
Association of GSK3 Polymorphisms With Brain Structural Changes in Major Depressive Disorder.
Archives of General Psychiatry, 66(7), 721-728.
AA genotype group associated with
decreased GM concentration in right STG
P = 0.0004
(corrected for whole brain search
and multiple SNP testing)
rs6438552 is a putative functional SNP
i.e. it regulates the selection of splice acceptor sites in vitro.
Inkster, B., Nichols, T. E., Saemann, P. G., Auer, D. P., Holsboer, F., Muglia, P., & Matthews, P. M. (2009).
Association of GSK3 Polymorphisms With Brain Structural Changes in Major Depressive Disorder.
Archives of General Psychiatry, 66(7), 721-728.
Wnt Signaling Pathways
WNT3A
FZD3
KRM1
DVL2
CTNNB1
AXIN2
TCF4
LEF1
SMAD1
PPARgC1a
EMX2
ZEB2
regulates the development of the hippocampus
WNT pathway genes
R
RR
ZEB2
FZD3
DVL2
AXIN2
GSK3β
SMAD1
PPARGCA1
EMX2
Inkster, B., Nichols, T. E., Saemann, P. G., Auer, D. P., Holsboer, F., Muglia, P., & Matthews, P. M. (2010).
Pathway-based approaches to imaging genetics association studies: WNT signaling, GSK3beta substrates and major depression.
NeuroImage, 53(3), 908-917.
Outline
•  Background
–  Structural brain imaging & VBM
–  Genetics
–  “Imaging Genetics”
•  Candidate SNP VBM
•  Multivariate SNP analyses
•  Voxel/Region QTL
–  Whole genome
association
–  Must have right ROI
100,000
voxels
•  Candidate SNP
–  Full image result
–  Must have right SNP
500,000 SNPs
≈ 1010
tests!
500,000 SNPs
≈ 106
tests
500,000 SNPs
100,000
voxels
•  Full cross analysis
–  Massive multiple testing
problem!
100,000
voxels
Possible MassUnivariate Analyses
≈ 105
tests
Multivariate Regression
Genotypes
Images
Y
=
N × NV
Regression
Coefficients
X
+
N × NG
•  Silly…
–  If N > NG, fit equivalent
to NV univariate models
fit independently
–  Much redundancy in C
•  rank{C} ≤ min(NV, NG)
≪ NV ∙ NG
Error
E
N × NV
C
NG × NV
N # subjects
NV # voxels/ROIs
NG # genes/SNPs
Reduced Rank Regression
Images
Y
Genotypes
=
N × NV
•  Fix rank r
•  Approximate
Image
Coefficients
X
A
r × NV
N × NG
Error
+
E
N × NV
Genotype
Coefficients B
C≈BA
B & A each rank r
N×r
N # subjects
NV # voxels/ROIs
NG # genes/SNPs
Sparse Reduced Rank
Regression
Images
Y
Genotypes
=
N × NV
•  Fix rank r
•  Approximate
X
Sparse Image
Coefficients
+
A
N × NG
r × NV
Error
E
N × NV
Sparse
Genotype B
Coefficients
C≈BA
B & A each rank r
NG × r
•  Enforce sparsity
Vounou, M., Nichols, T. E., & Montana, G. (2010). Discovering genetic associations with high-dimensional
neuroimaging phenotypes: A sparse reduced-rank regression approach. NeuroImage, 53(3), 1147-59.
N # subjects
NV # voxels/ROIs
NG # genes/SNPs
Sparse Reduced Rank
Regression - Estimation
•  RRR
–  Y = X A B + E
–  For fixed rank r, find A & B that minimize
M = tr { (Y−XBA) Γ (Y−XBA)’ }
for some NV × NV matrix Γ, e.g. Γ = I
•  SRRR
–  For rank 1, find a & b that minimize
M = tr { (Y−Xba’) Γ (Y−Xba’)’ }
+ λa||a||1 + λb||b||1
–  Then subtract Xba’ from the data, and repeat
–  Need to specify final rank r, λa & λb
•  Can set λa & λbin terms of #|a|>0 & #|b|>0
Simulation: Phenotype & SNPs
•  Simulated MRI data
–  ADNI T1 images through SPM5 VBM pipeline
–  NV = 111 ROIs, placed on VBM data from 189 MCI
ADNI subjects
•  GSK CIC Atlas, based on Harvard-Oxford atlas
–  Estimate covariance Σ after adjusting for age &
gender
–  Simulate ROI data (for arbitrary N) with covariance Σ
•  Evaluate with realistic genetic population w/
FREGENE
–  Simulates sequence-level data in large population
–  Provides 10K individuals, 20Mb chromosome (~180K
SNPs)
•  Chadeau-Hyam, et al. BMC Bioinformatics, 9:364, 2009
Simulation: Phenotype & SNPs
•  FREGENE SNP simulation
– 
– 
– 
– 
Population of 10,000 evolved over 200,000 generations
20Mb simulated
37,748 SNPs with MAF>0.05
Select k=10 causative SNPs
•  From all possible having MAF=0.2
•  Used to induce phenotypic effect
–  But then dropped from consideration
•  Represents realistic setting, where causative SNP is not seen, but effect
captured through local LD
–  From population of 10,000, repeatedly sample cohorts of size N
•  Simulated association in MRI data
–  Add genetic effect to Frontal and Temporal ROIs with causative
SNPs
•  γ = 0.06, 0.08, or 0.1 reduction in mean GM in affected ROI
•  Calibrated to Filipini et al. (2009)
–  10% reduction in GM ApoE ε4/ε4 subjects relative to subjects with no ε4 alleles
Out of Africa (OoA)
split & bottleneck
Founding population
in Africa
Expansion
Expansion
Expansion
Chadeau-Hyam, et al. BMC Bioinformatics, 9:364, 2009
Asian &
European split
FREGENE: Evolutionary
model of world population
•  Linkage disequilibrium (LD)
–  SNPs not independent
–  Highly structured,
heterogeneous
dependence
•  Population sub-structure
–  Ethnic differences &
migration patterns
induce systematic
variation
•  Multivariate analysis
–  Want realistic multivariate
structure in our simulations
The Wellcome Trust Case Control Consortium, Nature 447, 661-678, 2007.
Why try so hard?
Why not rand{0,1,2}500,000 ?
Realistic Phenotype
•  All pairwise GM
correlations
among NV = 111
ROIs
Realistic Genotypes
•  Correlation of first
1000 simulated
SNPs
Simulation Setting:
Horse shoes & Imaging Genetics
•  “True positive” with
missing causative
SNP
–  Declare true positive
if LD coefficient close
enough
•  LD-linked SNPs
–  Of 1990 SNPs
–  51 linked (r>0.8) to
one or more the 10
causative SNPs
SRRR Simulation Results
•  Power to detect 1 or more SNPs (NG=1990)
•  For ranks r = 1,2,3 dominates Mass Uni.
–  Better for higher r
SRRR Simulation Results
•  Power to detect 1 or more SNPs (NG=1990)
•  For ranks r = 1,2,3 dominates Mass Uni.
–  Better for higher r; here r = 3, high eff. size.
SRRR Simulation Results
•  Power to detect 1 or more ROIs
•  Less difference
–  Power can be manipulated by varying λ by
rank
SRRR: Multivariate vs. MassUnivariate
•  Does this NG=1990
result generalize?
•  For up to 40k SNPs
–  r = 3, med. effect
size, N=1000
–  Power 2-5 greater
–  Absolute power still
tiny
SRRR Simulation Results
•  Power to detect 1 or more SNPs (NG=1990)
•  For ranks r = 1,2,3 dominates Mass Uni.
–  Better for higher r; here r = 3
Sparse Reduced Rank Regression
for SNP – MRI Association
•  Detailed simulation of imaging & genetic
correlations structure
–  Suggests multivariate approach will outperform mass-univariate
–  Power tiny, in any event
•  Much work to do
–  Haven’t addressed how to optimize phenotype
–  Haven’t tried to estimate penalty parameters
λa, λb or r
•  Currently investigating stability selection
–  See #316 Le Floch et al
Conclusions
•  VBM
–  Powerful, automated anatomical analysis
–  Need careful raw data, preprocessing & model QC
•  Imaging Genetics
–  Mash-up of two large data, massive multiple testing
problems
•  Candidate SNP VBM
–  Given a SNP, just like a traditional imaging analysis
–  Multiple SNPs possible too, but need combining
methods
•  Multivariate Sparse Reduced Rank Regression
–  Promising, but little power unless have 1,000’s of
subjects
Download