Modelling Genetic Variation in the Brain Thomas Nichols, PhD Department of Statistics, Warwick Manufacturing Group University of Warwick joint with Becky Inkster Institute of Psychiatry King’s College London (GSK3β & WNT pathway VBM) Maria Vounou, Giovanni Montana Statistics Section, Dept. of Mathematics Imperial College (Sparse Reduced Rank Regression) Outline • Background – Structural brain imaging & VBM – Genetics – “Imaging Genetics” • Candidate SNP VBM • Multivariate SNP analyses Neuroimaging Background: Structural Brain Image Data • Morphometry – Quantification of shape/volume of brain structures • Traditional Morphometric Analysis – Laborious hand-tracing of structures – Accurate, but imperfect inter-rater reliability • Voxel Based Morphometry – Automated morphometry method Voxel Based Morphometry Original MRI → Segment → Warp to Atlas Space → Modulate → Smooth Subject Space T1-weighted MRI Subject Space Gray Matter Voxel Based Morphometry Original MRI → Segment → Warp to Atlas Space → Modulate → Smooth Modulation Gives units of subject GM volume in atlas space Atlas Space Atlas Space Atlas Space Allows analysis in common space while retaining individual differences T1-weighted MRI Gray Matter Modulated Gray Matter Voxel Based Morphometry Original MRI → Segment → Warp to Atlas Space → Modulate → Smooth Smoothing • Accounts for imperfect registration of individuals to atlas – Even identical twins have different cortical foldings – Exact match impossible • Discards fine spatial details in exchange for reduced noise Atlas Space Atlas Space – Generally searching for moderate scale differences Done! • 3D image is n=1 – A single (100,000-dimensional) phenotypic measurement on 1 individual Modulated Gray Matter Smoothed, Modulated GM Genetics Background • Genotype – The genetic constitution of an organism or cell – 46 chromosomes in humans – 23 pairs of homologous chromosomes • One each from each parent • Gene – A series of basepairs (DNA bits) which code for a trait – Four different possible basepairs, the nucleotides • Adenine, Thymine, Cytosine, & Guanine Genetics Background • Single Nucleotide Polymorphisms (SNP) – Locations where single base-pair differences bases have been found in the population • SNP Example – If some of the population has sequence… AATGTGATAGCTT – And if remaining has… AATGTGACAGCTT – We have found a SNP! SNP • SNP data – Homologous chromosomes – For each SNP, for each individual: 0, 1 or 2 count 8 Genetics Background • Millions of SNPs • Thanks to correlation (linkage disequilibrium), only need ≈500k to “tag” all variation 5 18 16 x 10 2.5 x 10 3,079,843,747 Base Pairs † 2 1.5 1 0.5 0 Number of SNPs per Chromosome 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y Number of Genes per Chromosome 4500 20,296,765 SNPs * 4000 14 3500 12 3000 10 2500 8 2000 6 1500 4 1000 2 500 0 Number of basepairs per Chromosome 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y 0 32,185 Genes † 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y † Genetics Background • SNPs vs. genes – Each gene often has several variants – 1 or more (but not many) SNPs typically needed to identify a gene – SNPs may not lie directly on coding portion of gene • Due to linkage disequilibrium (correlation), close is good enough • Non-coding, regulatory region may be causal Location on chromosome Exon Exon Exon Exon Exon SNPs Exon Imaging Genetics • Motivation – Brain structure heritable – Objective, reproducible phenotype • Important in psychiatry Brain Phenotype h2 Whole brain volume 0.78 Total gray matter volume 0.88 Total white matter volume 0.85 Glahn, Thompson, Blangero. Hum Brain Mapp 28:488-501, 2007 Thickness of Cortical GM (r2) – Current best measures are coarse, with weak reproducibility » e.g. HAM-D (depression), MMSE (cognition, AD) – Sensitive • Brain anatomy/function closer to disease process than other measures – Use to collaborate other findings Thompson et al, Nature Neuro, 4(12):1253-1258,. 2001 Heritability of GM Thickness (h2 & corrected P-value) • E.g. Large WGA finds modest significance Use brain imaging to build confidence in finding Thompson & Toga, Annals of Medicine 34(7-8):523-36, 2002 Imaging Genetics Menu Imaging Candidate ROI Many ROI Voxelwise Genetics Candidate SNP Candidate Gene Genome-wide SNP Genome-wide Gene [Filippini et al. 2009] 29,812 voxels 1 SNP [Joyner et al. 2009] 4 ROIs, 11 SNPs [Potkin et al. 2009] 1 BOLD ROI 317, 503 SNPs [Stein et al. 2010] 31,622 voxels 448,293 SNPs [Hibar et al. 2011] 31,622 voxels 18,044 SNPs (Jason Stein/Andy Saykin/Bertrand Thirion) Outline • Background – Structural brain imaging & VBM – Genetics – “Imaging Genetics” • Candidate SNP VBM • Multivariate SNP analyses GSK3β Background • High heritability of depression (Kendler et al. 2006; Sullivan et al., 2000). • Meta-analytical evidence from MRI studies for a role of hippocampal integrity in depression (Campbell et al., 2004). • There is strong genetic regulation of neurodevelopment (reviewed by Wilson and Rubenstein, 2000; O Leary et al., 2002). • The Wnt signaling pathway is one network of proteins that play a role in embryogenesis • GSK3β plays a key role in Wnt pathway Wnt Signaling Pathways regulates the development of the hippocampus GSK HiTDIP Study • Major Depressive Disorder (MDD) Association Study – “High Throughput Human Disease Specific Targets” – 7,000 SNPs covering 2,000 genes with tractable targets – 1000 cases, 1000 controls • Imaging Subset – 200 cases, 200 controls (of 1000 & 1000) scanned with anatomical MRI protocol – Optimized VBM with SPM5 s segmentation tool – 324 images passed QC • 366 subjects data delivered • 42 subjects set aside (clinical exclusion, pathologies or failed segmentation) • Glycogen synthase kinase 3β (GSK3β) – Plays key role in WNT pathway, influential in development Modelling Candidate SNPs • Mass Univariate Modelling – Fit same univariate linear model at each voxel • Quantitative Trait Multiple Regression – Linear model fit at each voxel • Regressors – Genetic – Group (Case/Control) – Demographic / nuisance variables 0 1 SNP Count 2 Xj Gray Matter Volume • Dominant Gray Matter Volume Y Y Gray Matter Volume • Recessive Gray Matter Volume SNP Models for Gray Matter Data Y 0 1 SNP Count 2 0 1 SNP Count 2 Y • Additive • Genotypic 0 1 SNP Count 2 Xj Xj Xj Mass Univariate Modelling Genetic Effects • Concerns about leverage/influence – 100’s not 1000’s of subjects – Rare SNP can make a few subjects very influential • An ever-greater problem as sample size shrinks Y Gray Matter Volume • 100 subjects + 10% MAF → 1 subject with rare genotype expected! 0 1 Allele Count 2 Xj Mass Univariate Modelling Genetic Effects • Ad hoc solution – If expected rare genotype frequency <10% merge genotypes • If MAF > 0.31 (=√0.1) – 2DF Genotypic model • Additive + Nonadditive Parameterization – Additive [ -1 0 +1 ] – Nonadditive [ -1/2 +1 -1/2 ] (orthogonalize w.r.t. additive regressor * ) • If MAF < 0.31 – Use dominant/recessive model tested not tested Mass Univariate Modelling Nuisance Effects • Age & Gender – Substantial normal variation in GM w/ Age • Total Gray matter – Accounts for differences in head size – Discounts global changes to find localized changes • Scanner (Pre/Post Upgrade) – Upgrade 2/3-through study altered image contrast • Medication (Yes/No, for cases only) – Neurotrophic effects reported for some Rx Model Diagnosis for Imaging • Why bother? – Largish n, continuous data, Central Limit Theorem should carry us – Type I Error generally OK due to robustness of t-test/ANOVA-like models Failed GM segmentation due to data formatting error • Sensitivity! – Decreased sensitivity due to inflated error variance σ – Suboptimal sensitivity due to non-normality • How!? – 100,000 voxels, 400 subjects – 100,000 QQ plots to look at all 40 million data points? Warping artefacts seen in modulated GM Model Diagnosis for Imaging Model Summaries • Model summaries – Images of diagnostic stats • Scan summaries – Vectors of ad hoc measures • Dynamic graphical tool – Explore many summaries simultaneously – Easily jump from summary image to plots, from plots to residual images • End Result – Swiftly localize and understand problems Luo & Nichols NeuroImage 19:1014–1032, 2003 Statistic Assesses Null Distn Cook-Weisberg Var(εi) = σ2 Chi-Squared Shapiro-Wilk ε ~ Normal (tabulated) Outlier Count Artifacts Binomial Std. Deviation Artifacts Scan Summaries Summary Interpretation Global intensity Whole-brain signals or artifacts Outlier Count Artifacts Any preprocessing parameters e.g. head size Experimental predictors Suggests cause of artifacts For investigating mismodelled signal in residuals http://go.warwick.ac.uk/tenichols/software Model Diagnosis w/ SPMd Model Summaries Scan Summaries Model Detail Scan Detail Outline / Motivation • Data – Intro to Voxel Based Morphometry data • Model – Quantitative trait regression w/ Mass Univariate Model • Diagnosis – 100,000 Q-Q plots anyone? • Inference – Cluster size under nonstationarity – Candidate screening procedure • Results – GSK3β in MDD • Future Directions Inference On Images: Voxel-wise vs. Cluster-wise • Voxel-wise – Reject Ho, point-by-point, by statistic magnitude • Cluster-wise – Define contiguous blobs with arbitrary threshold uclus – Reject Ho for each cluster larger than kα uclus space Cluster not significant kα kα statistic image Cluster significant Cluster Inference & Stationarity • Cluster-wise preferred over voxel-wise VBM: Image of FWHM Noise Smoothness – Generally more sensitive Friston et al, NeuroImage 4:223-235, 1996 – Spatially-extended signals typical • Problem w/ VBM – Standard cluster methods assume stationarity, constant smoothness – Assuming stationarity, false positive clusters will be found in extra-smooth regions – VBM noise very non-stationary • Nonstationary cluster inference – Must un-warp nonstationarity – Reported but not implemented • Hayasaka et al, NeuroImage 22:676– 687, 2004 – Now available as SPM toolbox • http://fmri.wfubmc.edu/cms/software#NS Nonstationary noise… …warped to stationarity Inference in Imaging Genetics: Creeping Multiple Testing Problem • Even just with candidate analyses, Can end up searching over… – Genes – SNPs within a gene – Space (voxels or clusters) – Different contrasts on GLM • Main effect? By clinical subgroup? Interactions? • Can quickly lose confidence in results – E.g. 0.005 FWE-corrected is great… …Unless it s the 25th statistic image you ve seen Inference in Imaging Genetics Multiple Testing Strategy • Define strict primary outcome – For given gene, use single SNP • Best (large) association study significance, otw • Best nonsynonymous exonic available, otw • Best 5 intronic available – For each SNP, only consider main effect of gene • If fitting gene x group interaction, test for average effect – – Any association is more likely than a disease-specific association Even if disease-specification association, opposing sign of effect unlikely w/ VBM – 1-number summary per gene • Minimum nonstationary cluster FWE-corrected P-value for association (1 DF F-stat) – Bonferroni correction for number of genes • Primary outcomes then have strong FWE control – Over brain, over genes – (1-α)100% confidence of no false positives anywhere • Secondary outcomes – Interactions, sub-group results – Use same FWE-inferences, but mark as post-hoc Results: Model Diagnosis Outlier Detection with Shapiro -Wilk -log10 P Shapiro-Wilk R Two outliers Mean Smoothed Mod. GM Results: Model Diagnosis Characterising Outliers with Standardized Residual Images R R Subject 193 Subject 194 Outlier Subject 195 Note: Compare standardized residuals to +/-6.128 Subject 194 raw T1 (Bonferroni for 324 images, each with 173,823 voxels, at each a 2-sided test) Severe enlargement of inferior horn of lateral ventricle Results: Outlier Exploration Subject 194 Outlier Inferior Horn of Lateral Ventricle In most of us, this is a pencil-lead-thick fluid-filled space In this subject it was a pencil-thick Clinical collaborator verified it as abnormal & subject was removed Randomly Selected Control GSK3β and Structural Differences R L 2 SNPs in strong linkage disequilibrium showed significant associations with GM differences in MDD patients: rs6438552 rs12630592 Brain regions where SNP clusters show colocalization. GSK3β-Gray Matter association in bilateral superior temporal gyrus (STG) and right hippocampus Inkster, B., Nichols, T. E., Saemann, P. G., Auer, D. P., Holsboer, F., Muglia, P., & Matthews, P. M. (2009). Association of GSK3 Polymorphisms With Brain Structural Changes in Major Depressive Disorder. Archives of General Psychiatry, 66(7), 721-728. AA genotype group associated with decreased GM concentration in right STG P = 0.0004 (corrected for whole brain search and multiple SNP testing) rs6438552 is a putative functional SNP i.e. it regulates the selection of splice acceptor sites in vitro. Inkster, B., Nichols, T. E., Saemann, P. G., Auer, D. P., Holsboer, F., Muglia, P., & Matthews, P. M. (2009). Association of GSK3 Polymorphisms With Brain Structural Changes in Major Depressive Disorder. Archives of General Psychiatry, 66(7), 721-728. Wnt Signaling Pathways WNT3A FZD3 KRM1 DVL2 CTNNB1 AXIN2 TCF4 LEF1 SMAD1 PPARgC1a EMX2 ZEB2 regulates the development of the hippocampus WNT pathway genes R RR ZEB2 FZD3 DVL2 AXIN2 GSK3β SMAD1 PPARGCA1 EMX2 Inkster, B., Nichols, T. E., Saemann, P. G., Auer, D. P., Holsboer, F., Muglia, P., & Matthews, P. M. (2010). Pathway-based approaches to imaging genetics association studies: WNT signaling, GSK3beta substrates and major depression. NeuroImage, 53(3), 908-917. Outline • Background – Structural brain imaging & VBM – Genetics – “Imaging Genetics” • Candidate SNP VBM • Multivariate SNP analyses • Voxel/Region QTL – Whole genome association – Must have right ROI 100,000 voxels • Candidate SNP – Full image result – Must have right SNP 500,000 SNPs ≈ 1010 tests! 500,000 SNPs ≈ 106 tests 500,000 SNPs 100,000 voxels • Full cross analysis – Massive multiple testing problem! 100,000 voxels Possible MassUnivariate Analyses ≈ 105 tests Multivariate Regression Genotypes Images Y = N × NV Regression Coefficients X + N × NG • Silly… – If N > NG, fit equivalent to NV univariate models fit independently – Much redundancy in C • rank{C} ≤ min(NV, NG) ≪ NV ∙ NG Error E N × NV C NG × NV N # subjects NV # voxels/ROIs NG # genes/SNPs Reduced Rank Regression Images Y Genotypes = N × NV • Fix rank r • Approximate Image Coefficients X A r × NV N × NG Error + E N × NV Genotype Coefficients B C≈BA B & A each rank r N×r N # subjects NV # voxels/ROIs NG # genes/SNPs Sparse Reduced Rank Regression Images Y Genotypes = N × NV • Fix rank r • Approximate X Sparse Image Coefficients + A N × NG r × NV Error E N × NV Sparse Genotype B Coefficients C≈BA B & A each rank r NG × r • Enforce sparsity Vounou, M., Nichols, T. E., & Montana, G. (2010). Discovering genetic associations with high-dimensional neuroimaging phenotypes: A sparse reduced-rank regression approach. NeuroImage, 53(3), 1147-59. N # subjects NV # voxels/ROIs NG # genes/SNPs Sparse Reduced Rank Regression - Estimation • RRR – Y = X A B + E – For fixed rank r, find A & B that minimize M = tr { (Y−XBA) Γ (Y−XBA)’ } for some NV × NV matrix Γ, e.g. Γ = I • SRRR – For rank 1, find a & b that minimize M = tr { (Y−Xba’) Γ (Y−Xba’)’ } + λa||a||1 + λb||b||1 – Then subtract Xba’ from the data, and repeat – Need to specify final rank r, λa & λb • Can set λa & λbin terms of #|a|>0 & #|b|>0 Simulation: Phenotype & SNPs • Simulated MRI data – ADNI T1 images through SPM5 VBM pipeline – NV = 111 ROIs, placed on VBM data from 189 MCI ADNI subjects • GSK CIC Atlas, based on Harvard-Oxford atlas – Estimate covariance Σ after adjusting for age & gender – Simulate ROI data (for arbitrary N) with covariance Σ • Evaluate with realistic genetic population w/ FREGENE – Simulates sequence-level data in large population – Provides 10K individuals, 20Mb chromosome (~180K SNPs) • Chadeau-Hyam, et al. BMC Bioinformatics, 9:364, 2009 Simulation: Phenotype & SNPs • FREGENE SNP simulation – – – – Population of 10,000 evolved over 200,000 generations 20Mb simulated 37,748 SNPs with MAF>0.05 Select k=10 causative SNPs • From all possible having MAF=0.2 • Used to induce phenotypic effect – But then dropped from consideration • Represents realistic setting, where causative SNP is not seen, but effect captured through local LD – From population of 10,000, repeatedly sample cohorts of size N • Simulated association in MRI data – Add genetic effect to Frontal and Temporal ROIs with causative SNPs • γ = 0.06, 0.08, or 0.1 reduction in mean GM in affected ROI • Calibrated to Filipini et al. (2009) – 10% reduction in GM ApoE ε4/ε4 subjects relative to subjects with no ε4 alleles Out of Africa (OoA) split & bottleneck Founding population in Africa Expansion Expansion Expansion Chadeau-Hyam, et al. BMC Bioinformatics, 9:364, 2009 Asian & European split FREGENE: Evolutionary model of world population • Linkage disequilibrium (LD) – SNPs not independent – Highly structured, heterogeneous dependence • Population sub-structure – Ethnic differences & migration patterns induce systematic variation • Multivariate analysis – Want realistic multivariate structure in our simulations The Wellcome Trust Case Control Consortium, Nature 447, 661-678, 2007. Why try so hard? Why not rand{0,1,2}500,000 ? Realistic Phenotype • All pairwise GM correlations among NV = 111 ROIs Realistic Genotypes • Correlation of first 1000 simulated SNPs Simulation Setting: Horse shoes & Imaging Genetics • “True positive” with missing causative SNP – Declare true positive if LD coefficient close enough • LD-linked SNPs – Of 1990 SNPs – 51 linked (r>0.8) to one or more the 10 causative SNPs SRRR Simulation Results • Power to detect 1 or more SNPs (NG=1990) • For ranks r = 1,2,3 dominates Mass Uni. – Better for higher r SRRR Simulation Results • Power to detect 1 or more SNPs (NG=1990) • For ranks r = 1,2,3 dominates Mass Uni. – Better for higher r; here r = 3, high eff. size. SRRR Simulation Results • Power to detect 1 or more ROIs • Less difference – Power can be manipulated by varying λ by rank SRRR: Multivariate vs. MassUnivariate • Does this NG=1990 result generalize? • For up to 40k SNPs – r = 3, med. effect size, N=1000 – Power 2-5 greater – Absolute power still tiny SRRR Simulation Results • Power to detect 1 or more SNPs (NG=1990) • For ranks r = 1,2,3 dominates Mass Uni. – Better for higher r; here r = 3 Sparse Reduced Rank Regression for SNP – MRI Association • Detailed simulation of imaging & genetic correlations structure – Suggests multivariate approach will outperform mass-univariate – Power tiny, in any event • Much work to do – Haven’t addressed how to optimize phenotype – Haven’t tried to estimate penalty parameters λa, λb or r • Currently investigating stability selection – See #316 Le Floch et al Conclusions • VBM – Powerful, automated anatomical analysis – Need careful raw data, preprocessing & model QC • Imaging Genetics – Mash-up of two large data, massive multiple testing problems • Candidate SNP VBM – Given a SNP, just like a traditional imaging analysis – Multiple SNPs possible too, but need combining methods • Multivariate Sparse Reduced Rank Regression – Promising, but little power unless have 1,000’s of subjects