Univariate Approaches: Multiple Testing & Voxelwise Whole Genome Association Jason L. Stein Laboratory of Neuro Imaging University of California, Los Angeles steinja@ucla.edu June 26, 2011 Brain structure is highly heritable Must be specific genetic variants explaining the high heritability (most of which are unknown) (Kremen et al., 2010) Two reasons to use genetic association on imaging data (1) Interested in finding the genetic variants that influence the brain structures/functions of interest (2) Interested in the genetic variants that influence disease state and believe that brain traits are quantitative traits closer to the genetics (greater penetrance) (adapted from Andy Saykin) Imaging Genetics Menu Imaging Candidate ROI Many ROI Voxelwise Gene$cist Imager Imager Imager Genetics Candidate SNP Candidate Gene Gene$cist Genome-wide SNP Gene$cist Genome-wide Gene Gene$cist Imaging Genetics Menu Imaging Genetics Candidate SNP Candidate Gene Genome-wide SNP Genome-wide Gene Candidate ROI Many ROI Voxelwise Characterizing the Effect of a Known Variant rs11136000 (CLU) Genome-wide association identifies variant within the CLU gene in ~4000 Alzheimer’s patients and ~8000 controls – but what does it do? (Harold et al., 2009) The Alzheimer’s associated variant broadly affects white matter integrity in a young cohort – may create early predisposition for disease (Braskie et al., 2009) Advantages/Disadvantages Advantages Disadvantages Candidate SNPs allow you to test a specific biological hypothesis It is highly likely that we don’t know the genetic underpinnings of a trait like brain structure so in general don’t know the right SNP to pick Strong hypothesis drives clearly interpretable results In order to be widely accepted, the variant needs to have strong prior evidence (genome-wide significant in a meta-analysis or have clear function) Multiple comparisons burden is reduced (one SNP – many voxels) Unable to search the genome, only characterize the effect of a known variant Quick way to provide functional relevance to unbiased genome-wide search results Low prior probability of any candidate to have effects on brain structure Choosing candidate genes is generally on the basis of limited information and therefore excludes the vast majority of genes expressed in the central nervous system (Glatt & Freimer, 2002) Percentage of Genes Expressed in Human Cortex (Gene Chip) Percentage of Genes Expressed in Mouse Brain (ISH) Expressed Not Expressed (Myers et al., 2007) (Lein et al., 2007) We generally don’t know the theoretical genetic underpinnings of a phenotype (Freimer & Sabatti, 2004) Imaging Genetics Menu Imaging Candidate ROI Many ROI Voxelwise Genetics Candidate SNP CLU & DTI ZNF804A & functional connectivity Candidate Gene Genome-wide SNP Genome-wide Gene Gene-based association test phenotype PCs of SNPs Fit A partial Full Model F-test is used to test the joint effect of the SNP PCs statistically controlling for the effects in the reduced model. Find all the markers in a gene and their correlations Conduct PCA to find the number of Get one P-value per gene components which explain 95% of variance in gene Fit Reduced Model Principal Component Regression Examples of gene-based tests on unitary imaging traits GRIN2B association to temporal lobe volume (Hibar et al., 2011) SORL1 association to hippocampal volume (Arias-Vasquez et al., in press) Advantages/Disadvantages Advantages Disadvantages Candidate genes allow you to test a specific biological hypothesis and group by the functional biological unit Similar problems about choosing the right candidate with strong enough prior evidence Reducing multiple comparisons by using only one gene-based test from multiple SNPs Unable to search the genome, only characterize the effect of a known variant Allelic heterogeneity taken into account Could be driven by only one SNP so need post-hoc tests to narrow to specific genic region Quick way to provide functional relevance to gene associations to disease Imaging Genetics Menu Imaging Candidate ROI Many ROI Voxelwise Genetics CLU & DTI Candidate SNP ZNF804A & functional connectivity Candidate Gene Genome-wide SNP Genome-wide Gene SORL1 & hippocampal volume Voxelwise vs. ROI approach Dependent on geometry of the signal = signal Signal overlaps with ROI definition ROI more powerful Signal does not overlap with ROI definition Voxelwise more powerful In search for genetic effects on brain structure – we generally are not clear where they are (Desikan et al., 2006) ~30,000 voxels in the brain Multiple Testing Problem 1.8 x 1010 tests! ~600,000 genetic markers (SNPs) Multiple Comparisons: GWAS 600,000 SNPs Percent Volume Change One SNP Position along genome A/A A/C C/C Genotype Null P-values: Uniform Distribution Independent null P-values: Beta(1,600000) distribution P-values P-values Multiple Comparisons Example Error: Not accounting for multiple comparisons Null P-values: Uniform Distribution 600,000 draws from a uniform distribution (Like GWAS on one phenotype). Minimum Pvalue gives very low P-values (1.7057x10-6 , 1.1026x10-6 ) Significant! Wow, I have such low P-values! But all of this is randomness (simulated from null distributions) Accounting for multiple comparisons assuming independence: Beta Distribution Beta(1,600000) distribution Models the multiple comparisons by picking the minimum P-value after 600,000 draws from a uniform distribution. Adjustment through CDF of Beta(1,600000) gives corrected P-values Raw P-value Corrected P-value 1.7057x10-6 0.646 1.1026x10-6 0.484 Multiple Comparisons: Correlation of Genetic Markers Linkage disequilibrium (LD; correlation between genetic markers) means that all tests are not independent. simpleM: a method to determine the effective number of tests conducted Meff where Meff ≤ M 1. Create correlation matrix Similar to permutation derived values 2. Calculate eigenvalues through PCA Correct P-values through a Beta(1, Meff) distribution 3. Number of principal components which jointly explain 99.5% of variance = Meff (Gao et al., 2008; Gao et al., 2010) Multiple Comparisons: Correction Across Voxels Through False Discovery Rate (FDR) Signal + Noise Control of False Discovery Rate at 10% 6.7% 10.4% 14.9% 9.3% 16.2% 13.8% 14.0% 10.5% 12.2% Percentage of Activated Pixels that are False Positives 8.7% (Tom Nichols website: http://www.sph.umich.edu/~nichols/FDR/; Genovese et al., 2002) Histogram visualization FDR significant – overrepresentation of low P-values Null P-values – no violations of assumptions Violation of assumptions – bimodal histogram Violation of assumptions – discrete P-value distribution (Pounds, 2006; Dabney & Storey, 2006) vGWAS (Stein et al., 2010) vGWAS (Stein et al., 2010) ADNI Dataset Subjects Genetics Imaging Phenotype Illumina 610-Quad BeadChip 740 Caucasian subjects to avoid population stratification Diagnosis: • 173 Alzheimer s disease patients • 361 Mild Cognitive Impairment • 206 healthy elderly Demographics: • 75.52 +/- 6.82 years • 438 male Exclusions: • genotype call rate < 95%, • deviation from HardyWeinberg equilibrium P<5.7x10-7 • minor allele frequency < 0.10 Tensor Based Morphometry 448,293 SNPs in analysis Each voxel encodes volume change relative to a studyspecific template 31,622 voxels in the brain when downsampled to 4x4x4 mm3 voxels (Stein et al., 2010) vGWAS (Stein et al., 2010) Computationally Intensive: GWAS on each voxel Genome-wide association on each phenotype takes ~9 minutes / phenotype. 31,622 voxels * 9 minutes = 198 days of computation! Across 300 nodes total computation time is 27h. http://pipeline.loni.ucla.edu/ Raw minimum P-value at each voxel (Stein et al., 2010) Most associated SNPs Chr 6q16.2 6q15 Base Pair 99778735 SNP MAF rs2132683 0.3257 Number of subjects in genotype groups Maj 340 Het 318 Min 82 Volume (mm3) 4224 Minimum mean P-value P-value 2.56x10-10 Gene or EST (±50 kb) 1.01 x10-6 91474473 rs713155 0.3966 274 345 121 7296 3.11x10-10 5.08 x10-7 34020651 121989829 46314816 rs476463 rs2429582 rs9990343 0.1203 0.3417 0.4811 567 319 197 168 331 374 5 86 169 1472 2496 2048 3.18x10-10 4.23x10-10 5.34x10-10 1.27 x10-6 6.46 x10-7 4.41 x10-7 115803577 rs490592 0.2149 255 29 14528 Highest expression in the b450rain, 20q13.12 43557937 rs11696501 0.1935 480 232 27 768 oligodendroglioma s upressor. A ssociated 3p12.1 84563758 rs10511089 0.1095 140 11 1664 Regulates synap$c and large 589d ense core 8q23.1 108858992 rs4534106 0.3007 367 301 72 1984 with A DHD a nd a ddic$on. vesicle priming in neurons, associa$ons to 1.39x10-9 1.32 x10-6 1.41x10-9 1.79x10-9 8.54 x10-7 6.57 x10-7 WFDC2, SPINT3 1.00 x10-6 6.21 x10-7 BG436399 1p35.1 7q31.32 3p21.31 11q23.3 CSMD2 CADPS2 0.3464 293 358 72 1024 2.27x10-9 2.29 x10-9 0.4061 au$sm 0.3824 263 277 347 354 124 103 768 256 2.30 x10-9 2.65 x10-9 1.21 x10-6 SHB 1.10 x10-7 KIAA0090, MRT04, AKR7L 283 234 255 574 539 283 219 274 340 369 341 146 177 345 353 339 109 131 119 11 17 106 160 121 832 2560 1408 1920 12736 3392 1856 4416 2.96 x10-9 3.17 x10-9 3.88 x10-9 4.39 x10-9 4.41 x10-9 4.68 x10-9 5.78 x10-9 5.98 x10-9 6.42 x10-7 1.42 x10-6 5.70 x10-7 6.06 x10-7 8.75 x10-7 1.06 x10-6 4.77 x10-7 8.25 x10-7 6q12 67705937 rs11970254 9p13.1 1p36.13 38030095 19441559 rs7025303 rs710865 9p13.1 20p12.1 2q37.3 16p12.1 5p12 13q32.2 14q22.1 6p12.3 38031142 12822585 242151629 24439219 44222425 97764318 51080549 49596867 rs7873102 rs2073233 rs12479254 rs11643520 rs4296809 rs688872 rs7140150 rs9473582 0.3821 0.4291 0.4049 0.1160 0.1448 0.3804 0.4566 0.3973 SHB BC036700 BOK, THAP4 RBBP6 BG334794 FARP1 FRMD6 GLYATL3 (Stein et al., 2010) Most associated voxels for most associated SNPs (Stein et al., 2010) vGWAS (Stein et al., 2010) Meff Estimation Meff << M (Stein et al., 2010) vGWAS (Stein et al., 2010) How well do results fit distributions? Raw P-value distribution Corrected P-value distribution (Stein et al., 2010) vGWAS (Stein et al., 2010) Significance Testing through FDR and pFDR q-value = 0.25 for SNP rs2132683 (Stein et al., 2010) Advantages/Disadvantages Advantages Disadvantages Able to jointly search the genome and imaging space to answer the question “where in the genome and where in the brain” A strong association of one voxel to one SNP is hard to interpret, we’re more interested in how a SNP affects many parts of the brain An unbiased approach to discovery Computationally intensive process (several days of processing) with a huge number of statistical tests Has some data reduction Selecting only the minimum P-value means that we lose a lot of information. Imaging Genetics Menu Imaging Candidate ROI Many ROI Voxelwise Genetics CLU & DTI Candidate SNP ZNF804A & functional connectivity Candidate Gene SORL1 & hippocampal volume Genome-wide SNP vGWAS Genome-wide Gene vGeneWAS (1) Use Tensor Based Morphometry based volume differences as phenotype at each voxel (2) GeneWAS at each voxel, select minimum P-value (3) Meff calculation through permutation and then estimating beta parameter of the Beta distribution (4) CDF of Beta(1,Meff) transformation (5) FDR (Hibar et al., 2011) Power of vGeneWAS vGeneWAS is more powerful than vGWAS in certain circumstances (Hibar et al., 2011) Advantages/Disadvantages Advantages Disadvantages Able to jointly search the genome and imaging space to answer the question “where in the genome and where in the brain” A strong association of one voxel to one gene is hard to interpret, we’re more interested in how a gene affects many parts of the brain An unbiased approach to discovery, grouping by functional unit Computationally intensive process (several days of processing) with a huge number of statistical tests Has a small amount of data reduction because we group by gene, and is more powerful than vGWAS, depending on the effect Selecting only the minimum P-value means that we lose a lot of information about other genes Imaging Genetics Menu Imaging Candidate ROI Many ROI Voxelwise Genetics CLU & DTI Candidate SNP ZNF804A & functional connectivity Candidate Gene SORL1 & hippocampal volume Genome-wide SNP vGWAS Genome-wide Gene vGeneWAS Replication through collaboration http://enigma.loni.ucla.edu Useful web resources UCSC genome browser: http://genome.ucsc.edu/cgi-bin/hgGateway Genome visualization magic. Hapmap: http://hapmap.ncbi.nlm.nih.gov/ Allele frequencies in multiple populations. Allen Brain Atlas: http://www.brain-map.org/ See where a gene is expressed. Entrez Gene: http://www.ncbi.nlm.nih.gov/gene/ See the gene ontology (what it does). dbSNP: http://www.ncbi.nlm.nih.gov/sites/entrez?db=snp The database of every documented genetic variation. Plink: http://pngu.mgh.harvard.edu/~purcell/plink/ Incredibly useful tool for genome-wide analysis, organization, etc. Excellent documentation. dbGaP: http://www.ncbi.nlm.nih.gov/gap/ Database of genotypes and phenotypes. Acknowledgements LONI (UCLA) Paul Thompson(Advisor) Derrek P. Hibar Neda Jahanshad Christina Boyle Xue Hua Meredith Braskie ADNI Genetics Core (Indiana University) Andrew Saykin Li Shen Tatiana Foroud Nathan Pankratz NeuroImaging Training Program Training Grant NIH/NIDA 1-T90-DA022768:02 ARCS Scholar Eskin Lab (UCLA) Jae Hoon Sul Hyun Min Kang Eleazar Eskin QTwin (Australia) Sarah Medland Margie Wright Katie McMahon Nick Martin Greig de Zubicaray UCLA Affiliates Scholarship Dr. Ursula Mandel Scholarship Pre-doctoral NRSA 1F31MH087061