Modelling Genetic Variation in the Brain Thomas Nichols, PhD Department of Statistics,

Modelling Genetic Variation in the Brain Thomas Nichols, PhD Department of Statistics, Warwick Manufacturing Group University of Warwick joint with Becky Inkster Institute of Psychiatry King’s College London (GSK3β & WNT pathway VBM) Maria Vounou, Giovanni Montana Statistics Section, Dept. of Mathematics Imperial College (Sparse Reduced Rank Regression) Outline •  Background –  Structural brain imaging & VBM –  Genetics –  “Imaging Genetics” •  Candidate SNP VBM •  Multivariate SNP analyses Neuroimaging Background: Structural Brain Image Data •  Morphometry –  Quantification of shape/volume of brain structures •  Traditional Morphometric Analysis –  Laborious hand-tracing of structures –  Accurate, but imperfect inter-rater reliability •  Voxel Based Morphometry –  Automated morphometry method Voxel Based Morphometry Original MRI → Segment → Warp to Atlas Space → Modulate → Smooth Subject Space T1-weighted MRI Subject Space Gray Matter Voxel Based Morphometry Original MRI → Segment → Warp to Atlas Space → Modulate → Smooth Modulation Gives units of subject GM volume in atlas space Atlas Space Atlas Space Atlas Space Allows analysis in common space while retaining individual differences T1-weighted MRI Gray Matter Modulated Gray Matter Voxel Based Morphometry Original MRI → Segment → Warp to Atlas Space → Modulate → Smooth Smoothing •  Accounts for imperfect registration of individuals to atlas –  Even identical twins have different cortical foldings –  Exact match impossible •  Discards fine spatial details in exchange for reduced noise Atlas Space Atlas Space –  Generally searching for moderate scale differences Done! •  3D image is n=1 –  A single (100,000-dimensional) phenotypic measurement on 1 individual Modulated Gray Matter Smoothed, Modulated GM Genetics Background •  Genotype –  The genetic constitution of an organism or cell –  46 chromosomes in humans –  23 pairs of homologous chromosomes •  One each from each parent •  Gene –  A series of basepairs (DNA bits) which code for a trait –  Four different possible basepairs, the nucleotides •  Adenine, Thymine, Cytosine, & Guanine Genetics Background •  Single Nucleotide Polymorphisms (SNP) –  Locations where single base-pair differences bases have been found in the population •  SNP Example –  If some of the population has sequence… AATGTGATAGCTT –  And if remaining has… AATGTGACAGCTT –  We have found a SNP! SNP •  SNP data –  Homologous chromosomes –  For each SNP, for each individual: 0, 1 or 2 count 8 Genetics Background •  Millions of SNPs •  Thanks to correlation (linkage disequilibrium), only need ≈500k to “tag” all variation 5 18 16 x 10 2.5 x 10 3,079,843,747 Base Pairs † 2 1.5 1 0.5 0 Number of SNPs per Chromosome 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y Number of Genes per Chromosome 4500 20,296,765 SNPs * 4000 14 3500 12 3000 10 2500 8 2000 6 1500 4 1000 2 500 0 Number of basepairs per Chromosome 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y 0 32,185 Genes † 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y † Genetics Background •  SNPs vs. genes –  Each gene often has several variants –  1 or more (but not many) SNPs typically needed to identify a gene –  SNPs may not lie directly on coding portion of gene •  Due to linkage disequilibrium (correlation), close is good enough •  Non-coding, regulatory region may be causal Location on chromosome Exon Exon Exon Exon Exon SNPs Exon Imaging Genetics •  Motivation –  Brain structure heritable –  Objective, reproducible phenotype •  Important in psychiatry Brain Phenotype h2 Whole brain volume 0.78 Total gray matter volume 0.88 Total white matter volume 0.85 Glahn, Thompson, Blangero. Hum Brain Mapp 28:488-501, 2007 Thickness of Cortical GM (r2) –  Current best measures are coarse, with weak reproducibility »  e.g. HAM-D (depression), MMSE (cognition, AD) –  Sensitive •  Brain anatomy/function closer to disease process than other measures –  Use to collaborate other findings Thompson et al, Nature Neuro, 4(12):1253-1258,. 2001 Heritability of GM Thickness (h2 & corrected P-value) •  E.g. Large WGA finds modest significance Use brain imaging to build confidence in finding Thompson & Toga, Annals of Medicine 34(7-8):523-36, 2002 Imaging Genetics Menu Imaging Candidate ROI Many ROI Voxelwise Genetics Candidate SNP Candidate Gene Genome-wide SNP Genome-wide Gene [Filippini et al. 2009] 29,812 voxels 1 SNP [Joyner et al. 2009] 4 ROIs, 11 SNPs [Potkin et al. 2009] 1 BOLD ROI 317, 503 SNPs [Stein et al. 2010] 31,622 voxels 448,293 SNPs [Hibar et al. 2011] 31,622 voxels 18,044 SNPs (Jason Stein/Andy Saykin/Bertrand Thirion) Outline •  Background –  Structural brain imaging & VBM –  Genetics –  “Imaging Genetics” •  Candidate SNP VBM •  Multivariate SNP analyses GSK3β Background •  High heritability of depression (Kendler et al. 2006; Sullivan et al., 2000). •  Meta-analytical evidence from MRI studies for a role of hippocampal integrity in depression (Campbell et al., 2004). •  There is strong genetic regulation of neurodevelopment (reviewed by Wilson and Rubenstein, 2000; O Leary et al., 2002). •  The Wnt signaling pathway is one network of proteins that play a role in embryogenesis •  GSK3β plays a key role in Wnt pathway Wnt Signaling Pathways regulates the development of the hippocampus GSK HiTDIP Study •  Major Depressive Disorder (MDD) Association Study –  “High Throughput Human Disease Specific Targets” –  7,000 SNPs covering 2,000 genes with tractable targets –  1000 cases, 1000 controls •  Imaging Subset –  200 cases, 200 controls (of 1000 & 1000) scanned with anatomical MRI protocol –  Optimized VBM with SPM5 s segmentation tool –  324 images passed QC •  366 subjects data delivered •  42 subjects set aside (clinical exclusion, pathologies or failed segmentation) •  Glycogen synthase kinase 3β (GSK3β) –  Plays key role in WNT pathway, influential in development Modelling Candidate SNPs •  Mass Univariate Modelling –  Fit same univariate linear model at each voxel •  Quantitative Trait Multiple Regression –  Linear model fit at each voxel •  Regressors –  Genetic –  Group (Case/Control) –  Demographic / nuisance variables 0 1 SNP Count 2 Xj Gray Matter Volume •  Dominant Gray Matter Volume Y Y Gray Matter Volume •  Recessive Gray Matter Volume SNP Models for Gray Matter Data Y 0 1 SNP Count 2 0 1 SNP Count 2 Y •  Additive •  Genotypic 0 1 SNP Count 2 Xj Xj Xj Mass Univariate Modelling Genetic Effects •  Concerns about leverage/influence –  100’s not 1000’s of subjects –  Rare SNP can make a few subjects very influential •  An ever-greater problem as sample size shrinks Y Gray Matter Volume •  100 subjects + 10% MAF → 1 subject with rare genotype expected! 0 1 Allele Count 2 Xj Mass Univariate Modelling Genetic Effects •  Ad hoc solution –  If expected rare genotype frequency <10% merge genotypes •  If MAF > 0.31 (=√0.1) –  2DF Genotypic model •  Additive + Nonadditive Parameterization –  Additive [ -1 0 +1 ] –  Nonadditive [ -1/2 +1 -1/2 ] (orthogonalize w.r.t. additive regressor * ) •  If MAF < 0.31 –  Use dominant/recessive model tested not tested Mass Univariate Modelling Nuisance Effects •  Age & Gender –  Substantial normal variation in GM w/ Age •  Total Gray matter –  Accounts for differences in head size –  Discounts global changes to find localized changes •  Scanner (Pre/Post Upgrade) –  Upgrade 2/3-through study altered image contrast •  Medication (Yes/No, for cases only) –  Neurotrophic effects reported for some Rx Model Diagnosis for Imaging •  Why bother? –  Largish n, continuous data, Central Limit Theorem should carry us –  Type I Error generally OK due to robustness of t-test/ANOVA-like models Failed GM segmentation due to data formatting error •  Sensitivity! –  Decreased sensitivity due to inflated error variance σ –  Suboptimal sensitivity due to non-normality •  How!? –  100,000 voxels, 400 subjects –  100,000 QQ plots to look at all 40 million data points? Warping artefacts seen in modulated GM Model Diagnosis for Imaging Model Summaries •  Model summaries –  Images of diagnostic stats •  Scan summaries –  Vectors of ad hoc measures •  Dynamic graphical tool –  Explore many summaries simultaneously –  Easily jump from summary image to plots, from plots to residual images •  End Result –  Swiftly localize and understand problems Luo & Nichols NeuroImage 19:1014–1032, 2003 Statistic Assesses Null Distn Cook-Weisberg Var(εi) = σ2 Chi-Squared Shapiro-Wilk ε ~ Normal (tabulated) Outlier Count Artifacts Binomial Std. Deviation Artifacts Scan Summaries Summary Interpretation Global intensity Whole-brain signals or artifacts Outlier Count Artifacts Any preprocessing parameters e.g. head size Experimental predictors Suggests cause of artifacts For investigating mismodelled signal in residuals http://go.warwick.ac.uk/tenichols/software Model Diagnosis w/ SPMd Model Summaries Scan Summaries Model Detail Scan Detail Outline / Motivation •  Data –  Intro to Voxel Based Morphometry data •  Model –  Quantitative trait regression w/ Mass Univariate Model •  Diagnosis –  100,000 Q-Q plots anyone? •  Inference –  Cluster size under nonstationarity –  Candidate screening procedure •  Results –  GSK3β in MDD •  Future Directions Inference On Images: Voxel-wise vs. Cluster-wise •  Voxel-wise –  Reject Ho, point-by-point, by statistic magnitude •  Cluster-wise –  Define contiguous blobs with arbitrary threshold uclus –  Reject Ho for each cluster larger than kα uclus space Cluster not significant kα kα statistic image Cluster significant Cluster Inference & Stationarity •  Cluster-wise preferred over voxel-wise VBM: Image of FWHM Noise Smoothness –  Generally more sensitive Friston et al, NeuroImage 4:223-235, 1996 –  Spatially-extended signals typical •  Problem w/ VBM –  Standard cluster methods assume stationarity, constant smoothness –  Assuming stationarity, false positive clusters will be found in extra-smooth regions –  VBM noise very non-stationary •  Nonstationary cluster inference –  Must un-warp nonstationarity –  Reported but not implemented •  Hayasaka et al, NeuroImage 22:676– 687, 2004 –  Now available as SPM toolbox •  http://fmri.wfubmc.edu/cms/software#NS Nonstationary noise… …warped to stationarity Inference in Imaging Genetics: Creeping Multiple Testing Problem •  Even just with candidate analyses, Can end up searching over… –  Genes –  SNPs within a gene –  Space (voxels or clusters) –  Different contrasts on GLM •  Main effect? By clinical subgroup? Interactions? •  Can quickly lose confidence in results –  E.g. 0.005 FWE-corrected is great… …Unless it s the 25th statistic image you ve seen Inference in Imaging Genetics Multiple Testing Strategy •  Define strict primary outcome –  For given gene, use single SNP •  Best (large) association study significance, otw •  Best nonsynonymous exonic available, otw •  Best 5 intronic available –  For each SNP, only consider main effect of gene •  If fitting gene x group interaction, test for average effect –  –  Any association is more likely than a disease-specific association Even if disease-specification association, opposing sign of effect unlikely w/ VBM –  1-number summary per gene •  Minimum nonstationary cluster FWE-corrected P-value for association (1 DF F-stat) –  Bonferroni correction for number of genes •  Primary outcomes then have strong FWE control –  Over brain, over genes –  (1-α)100% confidence of no false positives anywhere •  Secondary outcomes –  Interactions, sub-group results –  Use same FWE-inferences, but mark as post-hoc Results: Model Diagnosis Outlier Detection with Shapiro -Wilk -log10 P Shapiro-Wilk R Two outliers Mean Smoothed Mod. GM Results: Model Diagnosis Characterising Outliers with Standardized Residual Images R R Subject 193 Subject 194 Outlier Subject 195 Note: Compare standardized residuals to +/-6.128 Subject 194 raw T1 (Bonferroni for 324 images, each with 173,823 voxels, at each a 2-sided test) Severe enlargement of inferior horn of lateral ventricle Results: Outlier Exploration Subject 194 Outlier Inferior Horn of Lateral Ventricle In most of us, this is a pencil-lead-thick fluid-filled space In this subject it was a pencil-thick Clinical collaborator verified it as abnormal & subject was removed Randomly Selected Control GSK3β and Structural Differences R L 2 SNPs in strong linkage disequilibrium showed significant associations with GM differences in MDD patients: rs6438552 rs12630592 Brain regions where SNP clusters show colocalization. GSK3β-Gray Matter association in bilateral superior temporal gyrus (STG) and right hippocampus Inkster, B., Nichols, T. E., Saemann, P. G., Auer, D. P., Holsboer, F., Muglia, P., & Matthews, P. M. (2009). Association of GSK3 Polymorphisms With Brain Structural Changes in Major Depressive Disorder. Archives of General Psychiatry, 66(7), 721-728. AA genotype group associated with decreased GM concentration in right STG P = 0.0004 (corrected for whole brain search and multiple SNP testing) rs6438552 is a putative functional SNP i.e. it regulates the selection of splice acceptor sites in vitro. Inkster, B., Nichols, T. E., Saemann, P. G., Auer, D. P., Holsboer, F., Muglia, P., & Matthews, P. M. (2009). Association of GSK3 Polymorphisms With Brain Structural Changes in Major Depressive Disorder. Archives of General Psychiatry, 66(7), 721-728. Wnt Signaling Pathways WNT3A FZD3 KRM1 DVL2 CTNNB1 AXIN2 TCF4 LEF1 SMAD1 PPARgC1a EMX2 ZEB2 regulates the development of the hippocampus WNT pathway genes R RR ZEB2 FZD3 DVL2 AXIN2 GSK3β SMAD1 PPARGCA1 EMX2 Inkster, B., Nichols, T. E., Saemann, P. G., Auer, D. P., Holsboer, F., Muglia, P., & Matthews, P. M. (2010). Pathway-based approaches to imaging genetics association studies: WNT signaling, GSK3beta substrates and major depression. NeuroImage, 53(3), 908-917. Outline •  Background –  Structural brain imaging & VBM –  Genetics –  “Imaging Genetics” •  Candidate SNP VBM •  Multivariate SNP analyses •  Voxel/Region QTL –  Whole genome association –  Must have right ROI 100,000 voxels •  Candidate SNP –  Full image result –  Must have right SNP 500,000 SNPs ≈ 1010 tests! 500,000 SNPs ≈ 106 tests 500,000 SNPs 100,000 voxels •  Full cross analysis –  Massive multiple testing problem! 100,000 voxels Possible MassUnivariate Analyses ≈ 105 tests Multivariate Regression Genotypes Images Y = N × NV Regression Coefficients X + N × NG •  Silly… –  If N > NG, fit equivalent to NV univariate models fit independently –  Much redundancy in C •  rank{C} ≤ min(NV, NG) ≪ NV ∙ NG Error E N × NV C NG × NV N # subjects NV # voxels/ROIs NG # genes/SNPs Reduced Rank Regression Images Y Genotypes = N × NV •  Fix rank r •  Approximate Image Coefficients X A r × NV N × NG Error + E N × NV Genotype Coefficients B C≈BA B & A each rank r N×r N # subjects NV # voxels/ROIs NG # genes/SNPs Sparse Reduced Rank Regression Images Y Genotypes = N × NV •  Fix rank r •  Approximate X Sparse Image Coefficients + A N × NG r × NV Error E N × NV Sparse Genotype B Coefficients C≈BA B & A each rank r NG × r •  Enforce sparsity Vounou, M., Nichols, T. E., & Montana, G. (2010). Discovering genetic associations with high-dimensional neuroimaging phenotypes: A sparse reduced-rank regression approach. NeuroImage, 53(3), 1147-59. N # subjects NV # voxels/ROIs NG # genes/SNPs Sparse Reduced Rank Regression - Estimation •  RRR –  Y = X A B + E –  For fixed rank r, find A & B that minimize M = tr { (Y−XBA) Γ (Y−XBA)’ } for some NV × NV matrix Γ, e.g. Γ = I •  SRRR –  For rank 1, find a & b that minimize M = tr { (Y−Xba’) Γ (Y−Xba’)’ } + λa||a||1 + λb||b||1 –  Then subtract Xba’ from the data, and repeat –  Need to specify final rank r, λa & λb •  Can set λa & λbin terms of #|a|>0 & #|b|>0 Simulation: Phenotype & SNPs •  Simulated MRI data –  ADNI T1 images through SPM5 VBM pipeline –  NV = 111 ROIs, placed on VBM data from 189 MCI ADNI subjects •  GSK CIC Atlas, based on Harvard-Oxford atlas –  Estimate covariance Σ after adjusting for age & gender –  Simulate ROI data (for arbitrary N) with covariance Σ •  Evaluate with realistic genetic population w/ FREGENE –  Simulates sequence-level data in large population –  Provides 10K individuals, 20Mb chromosome (~180K SNPs) •  Chadeau-Hyam, et al. BMC Bioinformatics, 9:364, 2009 Simulation: Phenotype & SNPs •  FREGENE SNP simulation –  –  –  –  Population of 10,000 evolved over 200,000 generations 20Mb simulated 37,748 SNPs with MAF>0.05 Select k=10 causative SNPs •  From all possible having MAF=0.2 •  Used to induce phenotypic effect –  But then dropped from consideration •  Represents realistic setting, where causative SNP is not seen, but effect captured through local LD –  From population of 10,000, repeatedly sample cohorts of size N •  Simulated association in MRI data –  Add genetic effect to Frontal and Temporal ROIs with causative SNPs •  γ = 0.06, 0.08, or 0.1 reduction in mean GM in affected ROI •  Calibrated to Filipini et al. (2009) –  10% reduction in GM ApoE ε4/ε4 subjects relative to subjects with no ε4 alleles Out of Africa (OoA) split & bottleneck Founding population in Africa Expansion Expansion Expansion Chadeau-Hyam, et al. BMC Bioinformatics, 9:364, 2009 Asian & European split FREGENE: Evolutionary model of world population •  Linkage disequilibrium (LD) –  SNPs not independent –  Highly structured, heterogeneous dependence •  Population sub-structure –  Ethnic differences & migration patterns induce systematic variation •  Multivariate analysis –  Want realistic multivariate structure in our simulations The Wellcome Trust Case Control Consortium, Nature 447, 661-678, 2007. Why try so hard? Why not rand{0,1,2}500,000 ? Realistic Phenotype •  All pairwise GM correlations among NV = 111 ROIs Realistic Genotypes •  Correlation of first 1000 simulated SNPs Simulation Setting: Horse shoes & Imaging Genetics •  “True positive” with missing causative SNP –  Declare true positive if LD coefficient close enough •  LD-linked SNPs –  Of 1990 SNPs –  51 linked (r>0.8) to one or more the 10 causative SNPs SRRR Simulation Results •  Power to detect 1 or more SNPs (NG=1990) •  For ranks r = 1,2,3 dominates Mass Uni. –  Better for higher r SRRR Simulation Results •  Power to detect 1 or more SNPs (NG=1990) •  For ranks r = 1,2,3 dominates Mass Uni. –  Better for higher r; here r = 3, high eff. size. SRRR Simulation Results •  Power to detect 1 or more ROIs •  Less difference –  Power can be manipulated by varying λ by rank SRRR: Multivariate vs. MassUnivariate •  Does this NG=1990 result generalize? •  For up to 40k SNPs –  r = 3, med. effect size, N=1000 –  Power 2-5 greater –  Absolute power still tiny SRRR Simulation Results •  Power to detect 1 or more SNPs (NG=1990) •  For ranks r = 1,2,3 dominates Mass Uni. –  Better for higher r; here r = 3 Sparse Reduced Rank Regression for SNP – MRI Association •  Detailed simulation of imaging & genetic correlations structure –  Suggests multivariate approach will outperform mass-univariate –  Power tiny, in any event •  Much work to do –  Haven’t addressed how to optimize phenotype –  Haven’t tried to estimate penalty parameters λa, λb or r •  Currently investigating stability selection –  See #316 Le Floch et al Conclusions •  VBM –  Powerful, automated anatomical analysis –  Need careful raw data, preprocessing & model QC •  Imaging Genetics –  Mash-up of two large data, massive multiple testing problems •  Candidate SNP VBM –  Given a SNP, just like a traditional imaging analysis –  Multiple SNPs possible too, but need combining methods •  Multivariate Sparse Reduced Rank Regression –  Promising, but little power unless have 1,000’s of subjects

Modelling Genetic Variation in the Brain Thomas Nichols, PhD Department of Statistics,

Related documents

Products

Support

Modelling Genetic Variation in the Brain Thomas Nichols, PhD Department of Statistics,

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib