Statistical Issues in Genetic
Association Studies
Eleanor Feingold, Ph.D.
University of Pittsburgh
March, 2011
Underlying Principle of Genetic Mapping
People who have similar traits (phenotypes)
should have greater than expected sharing of
genetic material near the genes that influence
those traits.
Basic study designs for gene mapping
families
unrelated
individuals
Basic study designs for gene mapping
families
linkage
analysis
(or association)
unrelated
individuals
association
analysis
Basic study designs for gene mapping
families
unrelated
individuals
semi-related
individuals in
an inbred
population
association
analysis
linkage
analysis
(or association)
?
Basic study designs for gene mapping
families
unrelated
individuals
semi-related
individuals in
an inbred
population
association
analysis
linkage
analysis
(or familybased
association)
?
Association analysis (circa 2000)
1) Collect cases and controls.
2) Genotype everyone at a marker.
AA AA
aa
Aa
aa
Aa
AA
aa
3) Test genotype/phenotype association.
cases
AA
Aa
aa
65
133
202
81
316
controls 16
Aa
AA
AA Aa aa
aa Aa
AA
aa
4) Call it a day and go out
for a beer with your
co-investigators.
GWAS Study circa 2010
1) Collect cases and controls.
2) Genotype everyone at a marker.
AAAA
Aa
aa Aa
AA
aa
3) Test genotype/phenotype association.
AA
Aa
aa
cases
65
133
202
controls
16
81
316
Aa
AA
AA Aaaa
aa Aa
AA aa
4) Call it a day and go
out for a beer with your
co-investigators.
Repeat
1,000,000
times!
So what’s the BIG DEAL?
Well, not much, until you get into
1) the complexities of array data, and
2) the real science of genetics.
One important genetic subtlety
Even in a GWAS study, we can’t test every variant on the
genome. So
1) at the design phase, we have to pick markers (SNPs) that
we hope will “cover” as well as possible, and
2) at the testing phase, we do not expect that the marker we
are testing is actually the “causal variant” - we are
usually hoping (at best) that it is correlated with the true
causal genetic variable.
Gene in
here
somewhere
Gene in
here
somewhere
After many generations ...
Within a population, genotypes at nearby SNPs are correlated due to
population history.
This correlation is called linkage disequilibrium.
“Tag” SNPs
Find a set of SNPs that captures most information at least cost.
How?
Find clusters of SNPs that are highly correlated and then choose one
representative from each cluster to genotype.
Easily-available relatively idiot-proof software (e.g. Tagger).
Caveat 1:
You need a database that knows lots of SNPs in your gene and has
genotyped them in a fair number of people in the population you are
studying (Hapmap, Seattle SNPs).
Caveat 2:
Beware of overly-aggressive “tagging.”
Conventional association vs. candidate gene
sequencing
Candidate gene
sequencing study
1) Expensive - fewer genes and
fewer people, so lower power
overall.
2) Find both common and rare
variation.
3) Find functional variants.
GWAS (tag SNP) study
1) Cheaper - more genes and more
people, so higher power.
2) Find only common variation.
3) Probably do not find functional
variants.
GWAS Analysis
Genotype calling
Data cleaning
Single-SNP analysis
Other analyses
CNVs
Genotype “calling”
Generally done before you
see the data.
BB
But plenty of open questions
about how to do it.
AB
AA
- best clustering methods?
- salvage data from messy
clusters?
Data cleaning
Somewhat dependent on which chip you are using.
Throw out “bad” SNPs and “bad” samples.
(% of genotypes “called” for each person and each SNP)
Hardy-Weinberg testing
Relationship testing
Find major chromosomal anomalies
Look for population stratification
Look for signs of systematic problems (e.g. allele frequencies differ by sample
processing date).
Data cleaning examples
Plate effect on missing call rate per sample
ANOVA p-value = 6e-48
But no significant association between plate and case status (p=0.20)
Gender Check
chromosomal anomalies
Testing Hardy-Weinberg
Hardy-Weinberg Equilibrium (HWE) means that your three genotype
groups occur in the expected p2, 2pq, q2 proportions.
Departure from HWE most often indicates genotyping problems.
But it can also indicate an actual genetic effect.
(Check for case-control differences).
Do your HWE tests by ethnicity, but don’t expect admixed groups
(hispanics, African-Americans) to be in HWE.
HWE 10-4 < p < 0.5
HWE p < 10-4
population stratification via principle components
Analysis
Simple association test at every SNP
A
a
Case-control association test by allele ...
case
2 x 2 table
(Fisher’s exact test or
chi-squared test)
control
And by genotype ...
AA
case
control
Aa
aa
2 x 3 table
(Fisher’s exact test or
chi-squared test or
Armitage trend test)
Or use logistic regression
Lets you incorporate other predictors (age, sex, diet,
whatever).
G + E (genotype + environment model)
G + E + GxE (interaction model)
GWAS results
Manhattan plot
and
qq plot
What’s the best single-SNP
association test?
Not as “solved” a problem as you’d think.
If you knew the true model for the gene effect,
you’d just fit that model. But you don’t.
So which tests are robust over lots of models?
Chia-Ling Kuo’s work
2x3 table
2x2 table
Indep Chisq Indep Chisq
Trend
001
012
011
REC
ADD
DOM
2DF TEST
===== MIN 2P =====
======== MIN 3P =======
============= MIN 4P ==============
ALLELE
Scan with Covariates
• Which logistic regression model is
best for testing GENETIC EFFECT?
– G: LR(G, NULL) ~ X2(1)
– G+E: LR(G+E, E) ~ X2(1)
– G+E+GE: LR(G+E+GE, E) ~ X2(2)
Results
1)
Combination statistics (best of several statistics) are
most robust, even after correction for multiple
comparisons, but linear trend test is also a good choice.
2)
To test for genetic effect, the G + E is almost never
advantageous. Just test G, or fit G + E + GxE if you’re
pretty sure there’s an interaction. BIG CAVEAT: This
assumes G and E are independent – if you are worried
about confounding, you DO need to control for E when
testing G.
More generally, should you use the same
statistics you used for a small-scale study?
Maybe not.
Problem
Need to worry about
the statistical properties
of the extreme values of
the test statistics.
What do I mean?
• Statisticians develop tests
that behave sensibly on average.
• But in genomic problems, we
do 10,000 or 500,000 of the
same test and then follow up the
top 100 results.
• So we need test statistics for which
the extreme values are well-behaved,
not so much the averages.
Example from expression arrays:
“10,000 t-tests” analysis
• Compute t-statistic for each gene.
• Rank by absolute value of t-statistic.
Problem
Ranked list is dominated
by small-variance genes.
With a small sample size,
the SE estimates are
very poor.
If you estimate an SE poorly
10,000 times, some of the
estimates will come out very
small.
x -y
t - statistic =
SEofdifference
2 ways to get a
large t-statistic
1) large difference
between the means
2) small SE
Solution
Shrinkage estimator!
(Add a fudge factor to the denominator of the t-statistic.)
x-y
t - statistic =
SE + a
Back to association studies ...
Whatever statistic you are using (1,000,000 times), you need
to know the statistical behavior of the 1st - 50th highest
order statistics, not the statistical behavior on average.
This issue has not really been dealt with in the association
study literature.
A few other open statistical issues
Multiple testing
The problem
The solution
If you do
1,000,000 tests,
you will produce
a lot of false
positives.
There isn’t one!
• Be realistic about hypothesis generating
vs. hypothesis testing.
• False discovery rate - controls percent of
genes on list that are false.
• Permutation testing - controls for lots of
correlated tests.
“Imputaton” at untyped SNPs
The idea
Use Hapmap database to impute genotypes for your samples at all the SNPs
in-between the ones you genotyped.
Do a test at each of those SNPs in addition to the typed ones.
Should increase overall study power even if multiple comparisons are
correctly controlled for.
typed SNP
untyped SNP
“blue” at typed
SNP => “blue”
at untyped one
as well
“Imputaton” at untyped SNPs
The best thing
Allows joint analyses of datasets that were genotyped with different chips!
Limitations
Only helpful if correlation structure in Hapmap is valid for your population.
Only helpful for SNPs in the database (contrast to haplotype analysis).
Open questions
• Best imputation methods in theory and practice?
• What populations should you base the imputation on?
• Imputed SNPs have different statistical properties (e.g. slightly higher
variance) – how do we account for that?
Meta-analysis
Typical GWAS papers now combine results from many
studies.
What are the best meta-analysis methods for doing this?
- What if same SNPs not typed in all studies?
- What if phenotype not measured the same way?
- What if some SNPs are imputed?
Software for genetic association
studies
PLINK is the primary tool. Bioinformatics is incorporated.
There are some useful R packages as well.
Need R for fancier analyses – typically integrate it with PLINK.
Lots of new stuff constantly under development for large-scale data
management and viewing – WGAViewer, LocusZoom
Lots of specialty packages for:
HWE
haplotype analysis
family association
other stuff