Population Stratification - Division of Statistical Genomics

advertisement
Population Stratification
Qunyuan Zhang
Division of Statistical Genomics
GEMS Course M21-621
Computational Statistical Genetics
Mar. 24, 2011
https://dsgweb.wustl.edu/qunyuan/presentations/PopStrat2011.pptx
1
What is Population Stratification (PS) ?
In narrow sense
PS is the presence of a
systematic difference in
allele frequencies between
subpopulations in a
population, possibly due to
different ancestry or origins,
especially in the context of
genetic association studies.
Population stratification is
also referred to as
population structure.
In broad sense
PS can be regarded as the
presence of a difference in
relatedness between
individuals in a population,
due to different
subpopulations,
family/pedigree structure
and/or cryptic relation.
2
PS & False Positives
False Positives (inflation)
Association could be due to the underlying
structure of the population, even there is
no disease-locus association.
3
An Example of PS-caused False Positive
Sub-population 1
case control total
A
72
8
80
a
18
2
20
total
90
10 100
Sub-population 2
case control total
A
3
27
30
a
7
63
70
10
90 100
Mixed population
case control total
A
75
35 110
a
25
65
90
100
100 200
risk
9/1
9/1
9/1
risk
1/9
1/9
1/9
risk
2.14
0.38
1.00
• No disease-locus
association.
• Risk difference between
sub-populations.
• Allele Frequency difference
between sub-populations.
• False disease-locus
association in mixed
population. (any allele with
higher frequency in higherrisk sub-population seems
to be risk allele)
4
Mantel-Haenszel Test for Stratification
Adjusted RR
An Example
Standard error
Chi-square test
(1)
(2)
(3)
5
Linear Model
Marker data
Population structure variable
Genetic background variable
Membership variable
Subgroup/sub-population variable
Ancestry/admixture proportion variable
Usually Q is unknown, needs to be estimated
6
Estimating Q by Eigen-analysis
singular values
X
snp1
snp2
snp3
snp4
snp5
=
idv1 idv2
0
1
0
0
2
U
idv3
2
2
0
1
0
1
2
1
0
0
-0.55
-0.78
-0.16
-0.20
-0.15
0.33
-0.10
0.04
0.14
-0.93
S
VT
3.81 0.00 0.00
0.00 2.05 0.00
0.00 0.00 1.13
T
S2
eigenvalues
0.34
-0.27
-0.71
0.52
0.20
14.51
0.00
0.00
0.00
4.21
0.00
0.00
0.00
1.28
-0.28
-0.75
-0.60
-0.95
0.29
0.08
0.11
0.59
-0.80
Q1
Q2
Q3
Eigenvector of COV(X)
References: Patterson et al. 2006, Price et al. 2006 (software EIGENSTRAT)
Or SAS Proc PRINCOM; R svd() and eigen()
7
Eigen-analysis of HapMap Populations
Q2
Q1
8
Estimating Q by MLE
(for admixed population)
G: Observed genotypes of admixed [and parental populations]
Q: Allelic frequencies in parental populations
P : Individual membership to be estimated
Goal: obtain P that maximizes Pr(G|P,Q)
1. Assign prior values for Q (randomly or estimated from parental population
genotype data) & P (randomly)
2. Compute P(i) by solving
3.
Compute Q(i) by solving
(G | Q, P)
0
 ( P)
(G | Q, P)
0
(Q)
4. Iterate Steps 1 and 2 until convergence.
Tang et al. Genetic Epidemiology, 2005(28): 289–301
9
Estimating Q by MCMC
(for admixed population)
Observed G : genotypes of admixed [and parental populations]
Unknown Z : admixed individuals’ membership from ancestral populations
Problem: How to estimate Z
?
Bayesian and Markov Chain Monte Carlo (MCMC) methods
1. Assume ancestral population number K (see next slide)
2. Define prior distribution Pr(Z) under K
3. Use MCMC to sample from posterior distribution Pr(Z|G) = Pr(Z)∙ Pr(G|Z)
4. Average over large number of MCMC samples to obtain estimate of Z
Falush et al. Genetics, 2003(164):1567–1587
Software : STRUCTURE
10
Infer Population Number (K)
11
Linear Model
(an example including m Q-variables)
y  a  bx  b1Q1  b2Q2  ... bmQm  e
m
y  a  bx   bi Qi  e
i 1
SAS Proc REG, Proc GENMOD; R lm(), glm()
Generalized, can fit binary/categorical y
12
Unified Mixed Model
(more general)
Inferred population
membership
SNP(s)
Covariate(s)
ID matrix
Modeling the
resemblance
among individuals
V=ZGZ'+R
13
Multi-Variate Normal Distribution (MVN)
& Likelihood of Mixed Model
Based on MVN, the likelihood of trait (y) in a matrix form is:
no. of individuals
(in a pedigree)
Kinship (IBD)
matrix (nn )
nn variancecovariance
matrix
phenotype
vector
mean
phenotype
vector
V=ZGZ'+R
V  2    I
2
a
2
e
14
Kinship
Inbreeding Coefficient
The inbreeding coefficient of an individual is the probability that the pair of
alleles carried by the gametes that produced it are Identical By Descent (IBD).
Identical By Descent (IBD)
Two alleles come from the same ancestry.
Kinship/Coancestry
The inbreeding coefficient of an individual is equal to the coancestry between
its parents. For example if parents X and Y have a child Z, then
inbreeding coefficient of Z = coancestry between X and Y
Software: SAS (PROC INBREED), MERLIN, SPAGedi , R(kinship, emma) et al.
(need pedigree and/or marker data)
15
Kinship Matrix
(expected probability of allele sharing among relatives)
16
Resources for Mixed Model with
Kinship Matrix
Software
Kinship
Mixed Model
Data
SAS
Proc INBREED
Proc MIXED
Quantitative trait
Pedigree data
SAS
Proc INBREED
Proc GLIMMIX
Quantitative/qualitative
trait, Pedigree data
R : kinship
makekinship()
lmekin()
Quantitative trait
Pedigree data
R: emma
emma.kinship()
emma.REML.t()
EMMAX
emmax-kin
emmax
Quantitative trait
Using maker data to
calculate kinship
17
Diagnosis of Inflation of False Positives
•Inflation: more false positives than expected
under the null
•In GWAS, usually due to PS
•Can be caused by inappropriate statistical
methods even with no PS
•May (not necessarily) indicate PS
18
Theoretical Basis of Diagnosis
Uniform distribution [0,1] of p-values under the null
no inflation
inflation
Histogram
-log10(p)
Q-Q plot
19
Inflation Rate (IR)
Devlin et al. 2004
For Binary Trait
For Continuous Trait
Amin , Duijn, Aulchenko, 2007
20
Genomic Control (by IR)
For Binary Trait
Yi 2  i2
For Continuous Trait
Yi 2  (ti )2
Or based on p-value
Yi 2  (21 pi ,df 1)
2
~ 2 Yi
Yi 
~  df2 1
ˆ
~2
2
~
pi  Pr ob(  df 1  Yi )
21
Practice
•Download and unzip the data from
dsgweb.wustl.edu/qunyuan/data/ popstra2011hw.zip
• Ignore pedigree.csv, test each SNP in snp.csv for association (with trait in
trait.csv);
•Investigate p-values to see if there is any inflation;
• Try to explain why;
•List some possible methods to reduce or control the inflation;
•Choose one method, apply it to the data;
•Does it work?
•Try to explain why.
•Clearly document each step of you analysis.
The is no standard answer, feel free to try anything you like !
Report back to linusan@wustl.edu and qunyuan@wustl.edu in one week.
Thanks !
22
Download