AA vs. Aa and aa

advertisement
Population Structure, Association
Studies, and QTLs
Stat 115/215
Structure Algorithm
• One of the most widely-used programs in
population genetics (original paper cited >9,000
times since 2000)
– Pritchard, Stephens and Donnelly (2000). Inference of
Population Structure Using Multilocus Genotype Data,
Genetics. 155:945-959.
• Very flexible model can determine:
– The most likely number of uniform groups
(populations, K)
– The genomic composition of each individual
(admixture coefficients)
– Possible population of origin
2
A simple model of population
structure
• Individuals in our sample represent a
mixture of K (unknown) ancestral
populations.
• Each population is characterized by
(unknown) allele frequencies at each locus.
• Within populations, markers are in HardyWeinberg and linkage equilibrium.
3
The model
• Let A1, A2, …, AK represent the (unknown) allele
frequencies in each subpopulation
• Let Z1, Z2, … , Zm represent the (unknown)
subpopulation of origin of the sampled individuals –
they are indicators
• Assuming HWE and LE within subpopulations, the
likelihood of an individual’s genotypes at various
loci in subpopulation k is given by the product of
the relevant allele frequencies:
4
More details
• Probability of observing a genotype at locus
l by chance in population is a function of
allele frequencies:
– Pl =pi2 for homozygous loci
– Pl =2pipj for heterozygous loci
• Assuming no linkage among the markers,
we have the product form as in the previous
page.
5
Heuristics
• If we knew the population allele frequencies
in advance, then it would be easy to assign
individuals (using Bayes rule).
P(Gi | Zi = k, A)P(Zi = k | A)
Pr(Zi = k | Gi , A1,… , Ak ) =
å P(Gi | Zi = j, A)P(Zi = j | A)
• If we knew the individual assignments, it
would be easy to estimate frequencies.
• In practice, we don’t know either of these,
but we have the Gibbs sampler!
6
MCMC algorithm (for fixed K)
• Start with random assignment of individuals
to populations
– Step 1: Gene frequencies in each population are
estimated based on the individuals that are
assigned to it.
– Step 2: Individuals are assigned to populations
based on gene frequencies in each population.
• And this is repeated...
• Estimation of K performed separately
7
Admixed individuals are mosaics
of ancestral populations
8
Two basic models
9
Inferred from human populations
10
More details
11
12
Alternative approach
• Structure is very computationally intensive
• Often no clear best-supported K-value
• Alternative is to use traditional multivariate
statistics to find uniform groups
• Principal Components Analysis is most
commonly used algorithm
• EIGENSOFT (PCA, Patterson et al., 2006;
PloS Genetics 2:e190)
13
Principal Component Analysis
• Efficient way to summarize multivariate data like
genotypes
• Each axis passes through maximum variation in
data, explains a component of the variation
14
Human population assignment with SNPs
• Assayed 500,000 SNP genotypes for 3,192 Europeans
• Used Principal Components Analysis to ordinate samples in space
• High correspondence between sample ordination and geographic
origin of samples
Individuals assigned to
populations of origin with
high accuracy
15
Genetic Association Tests
• Review of typical approach: chi-square test
– 2x3 table (or 2x2 table)
AA A a
aa
Total
Cases
n11
n12
n13
n1.
Controls
n21
n22
n23
n2.
n.1
n.2
n.3
n..
Total
A
a
Tota
l
Cases
n11
n12
n1.
Controls
n21
n22
n2.
Total
n.1
n.2
n..
– Alternatively, we can do a logistic regression
P(Y =1)
log
= a + bX
P(Y = 0)
16
Genetic Models and
Underlining Hypotheses
 Genotypic Model
Genotype
AA
Genotypic Value
μAA
Aa
aa
μAa
μaa
Genotypic value is
the expected
phenotypic value
of a particular
genotype
 Hypothesis: all 3 different genotypes have
different effects
AA vs. Aa vs. aa
Genetic Models and
Underlining Hypotheses
Dominant Model
Genotype
AA
Genotypic Value
μA-
Aa
aa
μAμaa
Hypothesis: the genetic effects of AA and Aa
are the same (assuming A is the minor allele)
AA and Aa vs. aa
Genetic Models and
Underlining Hypotheses
 Recessive Model
Genotype
AA
Genotypic Value
μA-
Aa
aa
μaμaa
 Hypothesis: the genetic effects of Aa
and aa are the same (A is the minor
allele)
19
AA vs. Aa and aa
Genetic Models and
Underlining Hypotheses
 Allelic Model
Genotype
Genotypic Value
AA
2μA
Aa
aa
μA+ μa
2μa
 Hypothesis: the genetic effects of allele A
and allele a are different
A vs. a
Pearson’s Chi-squared Test
 Genotypic Model:
 Null Hypothesis: Independence
H0 :  ij   i.   . j
cases
controls
AA
nAA
mAA
Aa
nAa
mAa
df = 2
aa
naa
maa
Pearson’s Chi-squared Test
 Dominant Model:
 Null Hypothesis: Independence
H0 :  ij   i.   . j
cases
controls
AA+Aa
nAA + nAa
mAA + mAa
df = 1
aa
naa
maa
Pearson’s Chi-squared Test
 Recessive Model:
 Null Hypothesis: Independence
H0 :  ij   i.   . j
cases
controls
AA
nAA
mAA
Aa +aa
nAa + naa
mAa + maa
df = 1
Pearson’s Chi-squared Test
 Allelic Model:
 Null Hypothesis: Independence
H0 :  ij   i.   . j
cases
controls
A
2nAA + nAa
2mAA + mAa
df = 1
a
nAa + 2naa
mAa +2 maa
Test Statistic
 Chi-squared Test Statistic:
(O  E )
  
E
all cells
2
2
 O is the observed cell counts
 E is the expected cell counts, under null
hypothesis of independence
(row total  column tot al )
E
N
Other Options
 Fisher’s Exact Test:
When sample size is small, the asymptotic approximation of
null distribution is no longer valid. By performing Fisher’s
exact test, exact significance of the deviation from a null
hypothesis can be calculated.

For a 2 by 2 table, the exact p-value can be calculated as:
a
b
c
d
Association Tool
 PLINK:
http://pngu.mgh.harvard.edu/~purcell/plink/
 Case-control, TDT, quantitative traits.
27
Mapping Quantitative Traits
• Examples: weight, height, blood pressure,
BMI, mRNA expression of a gene, etc.
• Example: F2 intercross mice
28
Quantitative traits (phenotypes)
133 females from our earlier (NOD  B6)  (NOD  B6) cross
Trait 4 is the log count of a particular white blood cell type.
29
Another representation of a trait distribution
30
Note the equivalent of dominance in our trait distributions.
A second example
31
Note the approximate additivity in our trait distributions here.
Trait distributions:
a classical view
In general we seek a difference
in the phenotype distributions
of the parental strains before we
think seeking genes associated
with a trait is worthwhile.
But even if there is little
difference, there may be many
such genes. Our trait 4 is a case
like this.
32
Data and goals
Data
Phenotypes: yi = trait value for mouse i
Genotype: xij = 1/0 of mouse i is A/H at marker j (backcross);
need two dummy variables for intercross
Genetic map: Locations of markers
Goals
•Identify the (or at least one) genomic region, called quantitative
trait locus = QTL, that contributes to variation in the trait
•Form confidence intervals for the QTL location
•Estimate QTL effects
Models: GenotypePhenotype
• Let y = phenotype,
g = whole genome genotype
• Imagine a small number of QTLw with genotypes
g1,…., gp (2p or 3p distinct genotypes for BC, IC resp).
•
We assume
E(y|g) = (g1,…gp ), var(y|g) = 2(g1,…gp)
34
Models: GenotypePhenotype, ctd
• Homoscedacity (constant variance)
2(g1,…gp) = 2 (constant)
• Normality of residual variation
y|g ~ N(g ,2 )
• Additivity:
(g1,…gp ) =  + ∑j gj (gj = 0/1 for BC)
• Epistasis: Any deviations from additivity.
35
Additivity, or non-additivity (BC)
36
Additivity or non-additivity: F2
37
The simplest method: ANOVA
• Split mice into groups
according to genotype
at a marker
• Do a t-test/ANOVA
• Repeat for each marker
• Adjust for multiplicity
LOD score = log10 likelihood ratio, comparing single-QTL
model to the “no QTL anywhere” model.
38
Interval mapping (IM)
• Lander & Botstein (1989)
• Take account of missing genotype data (uses the HMM)
• Interpolates between markers
• Maximum likelihood under a mixture model
39
Interval mapping, cont
• Imagine that there is a single QTL, at position z between two
(flanking) markers
• Let qi = genotype of mouse i at the QTL, and assume
•
yi | qi ~ Normal( qi , 2 )
• We won’t know qi, but we can calculate
•
pig = Pr(qi = g | marker data)
• Then, yi, given the marker data, follows a mixture of normal
distributions, with known mixing proportions (the pig).
• Use an EM algorithm to get MLEs of  = (A, H, B, ).
• Measure the evidence for a QTL via the LOD score, which is the log10
likelihood ratio comparing the hypothesis of a single QTL at position z
to the hypothesis of no QTL anywhere.
40
Epistasis, interactions, etc
• How to find interactions?
– Stepwise regression
– BEAM (Zhang and Liu 2007)
41
Naïve Bayes model
Y
X1
X2
X3
Xm
42
Augmented Naïve Bayes
Group 0
X01
X2.21
Y
X02
X2.22
Group 22
X11
X12
Group 1
X13
X2.12
X2.11
X2.13
Group 21
43
Variable Selection with Interaction
Let Y ∈ R be a univerate response variable and X ∈ R p
be a vector of p continuous predictor variables
Y = X 1× X 2+ ϵ , ϵ ∼ N (0,σ 2 ), X ∼ MVN(0, I p )
Suppose p= 1000 . How to find X 1 and X 2 ?
One step forward selection :∼500,000 interaction terms
Is there any marginal relationship between Y and X 1 ?
44
σ̂ = 2.24
2
(1)
σ̂ = 0.97 σ̂ = 0.42
2
(2)
2
(3)
y
45
x1
x2
46
x1
Acknowledgment
• Terry Speed (some of the slides)
• Karl Broman (U of Wisconsin)
• Steven P. DiFazio (West Virginia U)
47
Download