Univariate Approaches: Multiple Testing & Voxelwise Whole Genome Association

advertisement
Univariate Approaches:
Multiple Testing &
Voxelwise Whole Genome Association
Jason L. Stein
Laboratory of Neuro Imaging
University of California, Los Angeles
steinja@ucla.edu
June 26, 2011
Brain structure is highly heritable
Must be specific genetic variants
explaining the high heritability
(most of which are unknown)
(Kremen et al., 2010)
Two reasons to use genetic
association on imaging data
(1)
Interested in finding the genetic variants that
influence the brain structures/functions of interest
(2)
Interested in the genetic variants that influence disease state and believe that
brain traits are quantitative traits closer to the genetics (greater penetrance)
(adapted from Andy Saykin)
Imaging Genetics Menu
Imaging
Candidate ROI
Many ROI
Voxelwise
Gene$cist Imager Imager Imager Genetics
Candidate SNP
Candidate Gene
Gene$cist Genome-wide SNP
Gene$cist Genome-wide Gene
Gene$cist Imaging Genetics Menu
Imaging
Genetics
Candidate SNP
Candidate Gene
Genome-wide SNP
Genome-wide Gene
Candidate ROI
Many ROI
Voxelwise
Characterizing the Effect of a
Known Variant
rs11136000 (CLU)
Genome-wide association identifies variant
within the CLU gene in ~4000 Alzheimer’s
patients and ~8000 controls –
but what does it do?
(Harold et al., 2009)
The Alzheimer’s associated variant
broadly affects white matter integrity
in a young cohort – may create early
predisposition for disease
(Braskie et al., 2009)
Advantages/Disadvantages
Advantages
Disadvantages
Candidate SNPs allow you to test a
specific biological hypothesis
It is highly likely that we don’t know the
genetic underpinnings of a trait like
brain structure so in general don’t
know the right SNP to pick
Strong hypothesis drives clearly
interpretable results
In order to be widely accepted, the
variant needs to have strong prior
evidence (genome-wide significant in
a meta-analysis or have clear function)
Multiple comparisons burden is
reduced (one SNP – many voxels)
Unable to search the genome, only
characterize the effect of a known
variant
Quick way to provide functional
relevance to unbiased genome-wide
search results
Low prior probability of any candidate
to have effects on brain structure
Choosing candidate genes is generally on the basis of limited
information and therefore excludes the vast majority of genes
expressed in the central nervous system (Glatt & Freimer, 2002)
Percentage of Genes Expressed
in Human Cortex (Gene Chip)
Percentage of Genes Expressed
in Mouse Brain (ISH)
Expressed Not Expressed (Myers et al., 2007)
(Lein et al., 2007)
We generally don’t know the theoretical
genetic underpinnings of a phenotype
(Freimer & Sabatti, 2004)
Imaging Genetics Menu
Imaging
Candidate ROI
Many ROI
Voxelwise
Genetics
Candidate SNP
CLU & DTI
ZNF804A &
functional
connectivity
Candidate Gene
Genome-wide SNP
Genome-wide Gene
Gene-based association test
phenotype
PCs of SNPs
Fit
A partial
Full Model
F-test is used to test the joint
effect of the SNP PCs statistically
controlling for the effects in the reduced
model.
Find all the markers in a gene and
their correlations
Conduct PCA to find the number of
Get one P-value per gene
components which explain 95% of
variance in gene
Fit Reduced Model
Principal Component Regression
Examples of gene-based tests
on unitary imaging traits
GRIN2B association to temporal
lobe volume
(Hibar et al., 2011)
SORL1 association to hippocampal
volume
(Arias-Vasquez et al., in press)
Advantages/Disadvantages
Advantages
Disadvantages
Candidate genes allow you to test a
specific biological hypothesis and
group by the functional biological unit
Similar problems about choosing the
right candidate with strong enough
prior evidence
Reducing multiple comparisons by
using only one gene-based test from
multiple SNPs
Unable to search the genome, only
characterize the effect of a known
variant
Allelic heterogeneity taken into
account
Could be driven by only one SNP so
need post-hoc tests to narrow to
specific genic region
Quick way to provide functional
relevance to gene associations to
disease
Imaging Genetics Menu
Imaging
Candidate ROI
Many ROI
Voxelwise
Genetics
CLU & DTI
Candidate SNP
ZNF804A &
functional
connectivity
Candidate Gene
Genome-wide SNP
Genome-wide Gene
SORL1 &
hippocampal
volume
Voxelwise vs. ROI approach
Dependent on geometry of the signal
= signal
Signal overlaps with ROI definition
ROI more powerful
Signal does not overlap with ROI definition
Voxelwise more powerful
In search for genetic effects on brain structure –
we generally are not clear where they are
(Desikan et al., 2006)
~30,000 voxels in the brain
Multiple Testing Problem
1.8 x 1010 tests!
~600,000 genetic markers (SNPs)
Multiple Comparisons: GWAS
600,000 SNPs
Percent Volume Change
One SNP
Position along genome
A/A
A/C
C/C
Genotype
Null P-values: Uniform Distribution
Independent null P-values:
Beta(1,600000) distribution
P-values
P-values
Multiple Comparisons Example
Error: Not accounting for multiple comparisons
Null P-values: Uniform Distribution
600,000 draws from a uniform distribution
(Like GWAS on one phenotype). Minimum Pvalue gives very low P-values (1.7057x10-6 ,
1.1026x10-6 )
Significant!
Wow, I have such low P-values! But all of this is
randomness (simulated from null distributions)
Accounting for multiple comparisons assuming
independence: Beta Distribution
Beta(1,600000) distribution
Models the multiple comparisons by picking the
minimum P-value after 600,000 draws from a
uniform distribution.
Adjustment through CDF of Beta(1,600000)
gives corrected P-values
Raw P-value
Corrected P-value
1.7057x10-6
0.646
1.1026x10-6
0.484
Multiple Comparisons:
Correlation of Genetic Markers
Linkage disequilibrium (LD; correlation
between genetic markers) means that all
tests are not independent.
simpleM: a method to determine the effective
number of tests conducted Meff where Meff ≤ M
1. Create correlation matrix
Similar to permutation derived values
2. Calculate eigenvalues through PCA
Correct P-values through a Beta(1, Meff) distribution
3. Number of principal components which
jointly explain 99.5% of variance = Meff
(Gao et al., 2008; Gao et al., 2010)
Multiple Comparisons: Correction Across
Voxels Through False Discovery Rate (FDR)
Signal + Noise
Control of False Discovery Rate at 10%
6.7%
10.4% 14.9% 9.3% 16.2% 13.8% 14.0% 10.5% 12.2%
Percentage of Activated Pixels that are False Positives
8.7%
(Tom Nichols website: http://www.sph.umich.edu/~nichols/FDR/; Genovese et al., 2002)
Histogram visualization
FDR significant – overrepresentation of low P-values
Null P-values – no violations
of assumptions
Violation of assumptions –
bimodal histogram
Violation of assumptions –
discrete P-value distribution
(Pounds, 2006; Dabney & Storey, 2006)
vGWAS
(Stein et al., 2010)
vGWAS
(Stein et al., 2010)
ADNI Dataset
Subjects
Genetics
Imaging
Phenotype
Illumina 610-Quad BeadChip
740 Caucasian subjects to
avoid population
stratification
Diagnosis:
• 173 Alzheimer s disease
patients
• 361 Mild Cognitive
Impairment
• 206 healthy elderly
Demographics:
•  75.52 +/- 6.82 years
•  438 male
Exclusions:
• genotype call rate < 95%,
• deviation from HardyWeinberg equilibrium
P<5.7x10-7
• minor allele frequency < 0.10
Tensor Based Morphometry
448,293 SNPs in analysis
Each voxel encodes volume
change relative to a studyspecific template
31,622 voxels in the brain
when downsampled to 4x4x4
mm3 voxels
(Stein et al., 2010)
vGWAS
(Stein et al., 2010)
Computationally Intensive:
GWAS on each voxel
Genome-wide association on each phenotype takes ~9 minutes / phenotype.
31,622 voxels * 9 minutes = 198 days of computation!
Across 300 nodes total computation time is 27h.
http://pipeline.loni.ucla.edu/
Raw minimum P-value at each voxel
(Stein et al., 2010)
Most associated SNPs
Chr 6q16.2 6q15 Base Pair 99778735 SNP MAF rs2132683 0.3257 Number of subjects in
genotype groups Maj 340 Het 318 Min 82 Volume
(mm3) 4224 Minimum mean P-value P-value 2.56x10-10 Gene or EST (±50 kb) 1.01 x10-6 91474473 rs713155 0.3966 274 345 121 7296 3.11x10-10
5.08 x10-7 34020651 121989829 46314816 rs476463 rs2429582 rs9990343 0.1203 0.3417 0.4811 567 319 197 168 331 374 5 86 169 1472 2496 2048 3.18x10-10 4.23x10-10 5.34x10-10 1.27 x10-6 6.46 x10-7 4.41 x10-7 115803577 rs490592 0.2149 255 29 14528 Highest expression in the b450rain, 20q13.12 43557937 rs11696501 0.1935 480 232 27 768 oligodendroglioma s
upressor. A
ssociated 3p12.1 84563758 rs10511089 0.1095 140 11 1664 Regulates synap$c and large 589d ense core 8q23.1 108858992
rs4534106
0.3007
367
301
72
1984 with A
DHD a
nd a
ddic$on. vesicle priming in neurons, associa$ons to 1.39x10-9 1.32 x10-6 1.41x10-9 1.79x10-9 8.54 x10-7 6.57 x10-7 WFDC2, SPINT3 1.00 x10-6 6.21 x10-7 BG436399 1p35.1 7q31.32 3p21.31 11q23.3 CSMD2 CADPS2 0.3464 293 358 72 1024 2.27x10-9 2.29 x10-9 0.4061 au$sm 0.3824 263 277 347 354 124 103 768 256 2.30 x10-9 2.65 x10-9 1.21 x10-6 SHB 1.10 x10-7 KIAA0090, MRT04,
AKR7L 283 234 255 574 539 283 219 274 340 369 341 146 177 345 353 339 109 131 119 11 17 106 160 121 832 2560 1408 1920 12736 3392 1856 4416 2.96 x10-9 3.17 x10-9 3.88 x10-9 4.39 x10-9 4.41 x10-9 4.68 x10-9 5.78 x10-9 5.98 x10-9 6.42 x10-7 1.42 x10-6 5.70 x10-7 6.06 x10-7 8.75 x10-7 1.06 x10-6 4.77 x10-7 8.25 x10-7 6q12 67705937 rs11970254 9p13.1 1p36.13 38030095 19441559 rs7025303 rs710865 9p13.1 20p12.1 2q37.3 16p12.1 5p12 13q32.2 14q22.1 6p12.3 38031142 12822585 242151629 24439219 44222425 97764318 51080549 49596867 rs7873102 rs2073233 rs12479254 rs11643520 rs4296809 rs688872 rs7140150 rs9473582 0.3821 0.4291 0.4049 0.1160 0.1448 0.3804 0.4566 0.3973 SHB BC036700 BOK, THAP4 RBBP6 BG334794 FARP1 FRMD6 GLYATL3 (Stein et al., 2010)
Most associated voxels for most
associated SNPs
(Stein et al., 2010)
vGWAS
(Stein et al., 2010)
Meff Estimation
Meff << M
(Stein et al., 2010)
vGWAS
(Stein et al., 2010)
How well do results fit distributions?
Raw P-value distribution
Corrected P-value distribution
(Stein et al., 2010)
vGWAS
(Stein et al., 2010)
Significance Testing through FDR
and pFDR
q-value = 0.25 for SNP rs2132683
(Stein et al., 2010)
Advantages/Disadvantages
Advantages
Disadvantages
Able to jointly search the genome and
imaging space to answer the question
“where in the genome and where in
the brain”
A strong association of one voxel to
one SNP is hard to interpret, we’re
more interested in how a SNP affects
many parts of the brain
An unbiased approach to discovery
Computationally intensive process
(several days of processing) with a
huge number of statistical tests
Has some data reduction
Selecting only the minimum P-value
means that we lose a lot of
information.
Imaging Genetics Menu
Imaging
Candidate ROI
Many ROI
Voxelwise
Genetics
CLU & DTI
Candidate SNP
ZNF804A &
functional
connectivity
Candidate Gene
SORL1 &
hippocampal
volume
Genome-wide SNP
vGWAS
Genome-wide Gene
vGeneWAS
(1) Use Tensor Based Morphometry based volume differences
as phenotype at each voxel
(2) GeneWAS at each voxel, select minimum P-value
(3) Meff calculation through permutation and
then estimating beta parameter of the Beta
distribution
(4) CDF of Beta(1,Meff) transformation
(5) FDR
(Hibar et al., 2011)
Power of vGeneWAS
vGeneWAS is more powerful than vGWAS in certain
circumstances
(Hibar et al., 2011)
Advantages/Disadvantages
Advantages
Disadvantages
Able to jointly search the genome and
imaging space to answer the question
“where in the genome and where in
the brain”
A strong association of one voxel to
one gene is hard to interpret, we’re
more interested in how a gene affects
many parts of the brain
An unbiased approach to discovery,
grouping by functional unit
Computationally intensive process
(several days of processing) with a
huge number of statistical tests
Has a small amount of data reduction
because we group by gene, and is
more powerful than vGWAS,
depending on the effect
Selecting only the minimum P-value
means that we lose a lot of information
about other genes
Imaging Genetics Menu
Imaging
Candidate ROI
Many ROI
Voxelwise
Genetics
CLU & DTI
Candidate SNP
ZNF804A &
functional
connectivity
Candidate Gene
SORL1 &
hippocampal
volume
Genome-wide SNP
vGWAS
Genome-wide Gene
vGeneWAS
Replication through collaboration
http://enigma.loni.ucla.edu
Useful web resources
UCSC genome browser: http://genome.ucsc.edu/cgi-bin/hgGateway
Genome visualization magic.
Hapmap: http://hapmap.ncbi.nlm.nih.gov/
Allele frequencies in multiple populations.
Allen Brain Atlas: http://www.brain-map.org/
See where a gene is expressed.
Entrez Gene: http://www.ncbi.nlm.nih.gov/gene/
See the gene ontology (what it does).
dbSNP: http://www.ncbi.nlm.nih.gov/sites/entrez?db=snp
The database of every documented genetic variation.
Plink: http://pngu.mgh.harvard.edu/~purcell/plink/
Incredibly useful tool for genome-wide analysis, organization, etc. Excellent
documentation.
dbGaP: http://www.ncbi.nlm.nih.gov/gap/
Database of genotypes and phenotypes.
Acknowledgements
LONI (UCLA)
Paul Thompson(Advisor)
Derrek P. Hibar
Neda Jahanshad
Christina Boyle
Xue Hua
Meredith Braskie
ADNI Genetics Core
(Indiana University)
Andrew Saykin
Li Shen
Tatiana Foroud
Nathan Pankratz
NeuroImaging Training Program Training Grant
NIH/NIDA 1-T90-DA022768:02
ARCS Scholar
Eskin Lab (UCLA)
Jae Hoon Sul
Hyun Min Kang
Eleazar Eskin
QTwin (Australia)
Sarah Medland
Margie Wright
Katie McMahon
Nick Martin
Greig de Zubicaray
UCLA Affiliates Scholarship
Dr. Ursula Mandel Scholarship
Pre-doctoral NRSA 1F31MH087061
Download