LD Mapping in Outbred Populations Day 3 Objective

advertisement
LD Mapping in Outbred Populations
Day 3
Objective
Describe the principles of LD-mapping or Association
Analysis in Outbred Populations
Concepts covered relevant to issues in ‘genomic selection’
1. Introduction – LD- versus LE-markers
2. Candidate gene versus high-density markers
3. General design and analysis of LD mapping
or association studies – single marker regression
4. Issues with single marker regression
a. Accounting for genetic relationships – fit polygenic effect
b. Overestimation of significant SNPs – fit random SNP effect
5. Some other methods for LD mapping
a. Other ‘simple’ models - Multi-SNP and haplotype models
b. More complex models
1
r2
Overview of Strategies for QTL mapping
1
c=.001
0.9
0.8
Outbred population
c=.01
Line/Breed cross
0.7
0.6
0.5
Linkage analysis
LD markers
Linkage analysis
LE markers
F2 / BC
families
c=.05
0.4
0.3
c=.1
0.2
c=.2
0.1
c=.5
0
0
5
10
15
Generation
20
AIL
HS/FS
25
Ext.
pedigree
LD used
Population wide
Recomb.
Recomb.
1 rnd
>1 rnd
1 rnd
>1 rnd
LD extent
Long
Smaller
Long
Smaller
Denser
Sparse
Denser
Marker map Sparse
Coverage
Map resol.
resol.
Genome wide
Poor
Better
LD mapping
LD markers
Cand.
Cand.
genes
High
density
Within family
Genome wide
Poor
Better
2
1
3 types of marker loci
Direct markers
LD
-markers
LD-markers
Functional mutations
- known genes
Q
q
In pop.-wide Linkage Disequilibrium
with mutation
Linkage phase
~consistent
across population
LE
-markers
LE-markers
Dekkers 2004. J.Anim.Sci
MQ
MQ
mq
MQ
mq
mq
In pop.-wide Linkage Equilibrium
with mutation
Linkage phase NOT consistent across families
Sire 2
Sire 1
Sire 3
M Q
M q
M Q
m q
m Q
m Q
Sire 4
M q
m q
3
1. Benefits of LD- over LE-Markers
Linkage phase tends to be consistent
across families and generations
MQ
MQ
mq
MQ
mq
mq
ƒ “Easier” to implement in genetic evaluation
genotype
y = marker haplotype + u + e
ƒ Estimation of effects:
ƒ does not require pedigreepedigree-based phenotypic data
ƒ Ideally, animals are unrelated
ƒ can be done in population of application vs. experimental cross
4
2
Examples of
gene tests in
commercial
breeding
Trait
Direct marker
Congenital
defects
Appearance
Milk quality
D = dairy cattle
B = beef cattle
C = poultry
P = pigs
S = sheep
Dekkers, 2004, JAS
Most tests used
commercially are
direct or LD markers
LD marker
BLAD (D)
Citrulinaemia (D,B)
DUMPS (D)
CVM (D)
Maple syrup urine (D,B)
Mannosidosis (D,B)
RYR (P)
CKIT (P)
MC1R/MSHR (P,B,D)
MGF (B)
κ-Casein (D)
β-lactoglobulin (D)
FMO3 (D)
RYR (P)
RN/PRKAG3 (P)
Meat quality
LE marker
RYR (P)
Polled (B)
RYR (P)
RN/PRKAG3 (P)
A-FABP/FABP4 (P)
H-FABP/FABP3 (P)
CAST (P, B)
>15 PICmarqTM (P)
THYR (B)
Leptin (B)
Feed intake
Disease
Reproduction
MC4R (P)
Prp (S)
F18 (P)
Booroola (S)
Inverdale(S)
Hanna (S)
Growth &
composition
Milk yield &
composition

B blood group (C)
K88 (P)
Booroola (S)
ESR (P)
PRLR (P)
RBP4 (P)
CAST (P)
IGF-2 (P)
MC4R (P)
IGF-2 (P)
Myostatin (B)
Callipyge (S)
DGAT (D)
GRH (D)
κ-Casein (D)
QTL (P)
QTL (B)
Carwell (S)
PRL (D)
QTL (D)
5
In outbred populations only some closely linked
markers will be in sufficient LD with QTL
See Day 1
r2
c=.001
1
0.9
0.8
c=.01
0.7
0.6
0.5
0.4
c=.05
0.3
c=.1
0.2
c=.5
0.1
c=.2
0
0
5
10
15
Generation
20
25
6
3
Extent of LD is driven by Ne
1.0
0.9
2
E(r2)E(r
= 1) =/ 1/(1+4N
(1 + 4N
ed)
ec)
(Sved, 1971)
0.8
Distance (Morgans)
2
Expected LD (r )
0.7
0.6
Ne=10
0.5
0.4
Ne=25
0.3
Ne=50
0.2
Ne=100
Ne=250
0.1
Ne=500
0.0
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
Distance (Morgans)
r2
c=.001
1
0.9
0.8
c=.01
0.7
2.
0.6
0.5
0.4
c=.05
0.3
c=.1
0.2
c=.5
0.1
c=.2
0
0
5
10
15
Generation
20
25
7
Candidate genes vs
high-density markers
How to find markers close enough
to QTL for population-wide LD?
Candidate gene
analysis
Find markers in genes that may
contain QTL based on
z their biological role
z location in a QTL region
z
Comparative data
z
Gene expression data
High density
genotyping
Genotype enough markers
such that each QTL will
have several markers close
enough such that at least
one marker will be in
sufficient LD with the QTL
to show an association
with phenotype.
8
4
A Revolution in Molecular Genetic Technology
2.8 million SNPs
Nature 2004
S
ingle
Single
N
ucleotide
Nucleotide
P
olymorphisms
Polymorphisms
High-through-put
SNP genotyping
AAGCCTTGATAATT
International Swine Genome
Sequencing Consortium
AAGCCTTGCTAATT
Illumina Bovine
50k Beadchip
r2
9
Overview of Strategies for QTL mapping
1
c=.001
0.9
0.8
Outbred population
c=.01
Breed/Line cross
0.7
0.6
0.5
Linkage analysis
LD markers
c=.05
0.4
0.3
c=.1
0.2
c=.2
0.1
c=.5
0
0
5
Linkage analysis
LE markers
HS/FS
LD mapping
LD markers
Extended Candidate High
genes density
F2 / BC AIL/RIL families pedigree
Generation
10
15
20
25
LD used
Population wide
Recomb.
Recomb.
1 rnd
>1 rnd
1 rnd
>1 rnd
LD extent
Long
Smaller
Long
Smaller
Denser
Sparse
Denser Few loci
Marker map Sparse
Coverage
Map resol.
resol.
Genome wide
Poor
Better
Within family
Genome wide
Poor
Better
Population wide
>>> 1 round
Small
Local
Dense
Genome
High
LDLD-LA analysis – see later 10
5
Candidate Gene Examples
Estrogen Receptor Gene
(Rothschild et al. 1991, Short et al. 1997)
Effect on Number Born Alive
ESR
genotype
AA
First parity
n=4,262
9.4
Later parities
n=4,753
10.0
AB
9.9
10.5
BB
10.2
10.7
11
MC4R mutation
and Test
(Kim et al., Mam. Gen. 2000)
a
293
C
N
C
N
295
S
I
297
I
299
D
P
N
P
L
300
I
Y
Allele 1
homozygote
sequence
S
I
I
L
I
Y
Allele 2
homozygote
sequence
11 vs 22 genotype
in 2 commercial types
Transmembrane
domains
NH2
I
II
1/1
III
IV
2/2
VI
V
VII
COO
H
1/2
542
466
Backfat Loin
(mm) depth
(mm)
Daily
Daily
Feed
Gain
Intake
(g/d)
(kg/d)
-1.3
+1.4
-26.0
-0.15
P<.05
P<.10
P<.10
P<.05
Slide courtesy Max Rothschild 12
6
3. General Design and Analysis of LD Mapping
or Association Studies – single marker regression
II.
On a ‘random’
random’ sample
of (unrelated) individuals
obtain:
I.
Phenotype for
quantitative trait
II.
Genotypes for
one or many markers
Genotype many cows
with phenotype
(or progenyprogeny-tested bulls)
for 50,000 SNPs
TRAINING DATA
Conduct statistical analysis for association between
genotype at a marker and phenotype (repeat for each marker)
Y = μ + marker genotype + e
Test for significance
13
Principle of LD marker
effect estimation
AAGCCTTGATAATT
AAGCCTTGCTAATT
Progeny tested bulls grouped by their genotype for a
particular SNP
A
A
A
C
C
C
SNP
Genotype
AA
Average
PTA protein
+20
AC
+15
CC
+10
Î SNP effect e
stimate = +5 for A
estimate
Repeat for all markers
I.
7
LD mapping / association analysis by
single marker regression analysis
y = 1n μ + Xg + e
y = vector of phenotypes
1n = vector of 1s allocating the mean to phenotype,
X = design matrix allocating records to the marker effect (0/1/2 or -1/0/1)
g = effect of the marker (= allele substitution effect)
e = vector of random deviates ~ N(0,σe2)
• Underlying assumption is that the marker will only affect the
trait if it is in LD with a QTL.
Hayes ‘07
15
Single marker regression
y = 1n μ + Xg + e
• The design vector 1n allocates phenotypes to the mean
• The design vector X allocates phenotypes to genotypes
X, Number of “2”
Animal
Phenotpe
SNP allele 1
SNP allele
Animal
1n
alleles
1
2.030502
1
1
1
1
0
2
3.542274
1
2
2
1
1
3
3.834241
1
2
3
1
1
4
4.871137
2
2
4
1
2
5
3.407128
1
2
5
1
1
6
2.335734
1
1
6
1
0
7
2.646192
1
1
7
1
0
8
3.762855
1
2
8
1
1
9
3.689349
1
2
9
1
1
10
3.685757
1
2
10
1
1
y vector
Hayes ‘07
16
8
Single marker regression
Estimate marker effect and mean as:
⎡ ∧ ⎤ ⎡1 '1
⎢ μ∧ ⎥ = ⎢ n n
⎢ g ⎥ ⎣ X'1n
⎣ ⎦
−1
1n ' X ⎤ ⎡1n ' y ⎤
X' X ⎥⎦ ⎢⎣ X' y ⎥⎦
⎡1⎤
⎢1⎥
⎢⎥
⎢1⎥
⎢⎥
⎢1⎥
⎢1⎥
[1111111111] ⎢ ⎥ = 10
⎢1⎥
⎢1⎥
⎢⎥
⎢1⎥
⎢1⎥
⎢⎥
⎢⎣1⎥⎦
⎡0⎤
⎢1⎥
⎢ ⎥
⎢1⎥
⎢ ⎥
⎢2⎥
⎢1⎥
[1111111111] ⎢ ⎥ = 8
⎢0⎥
⎢0⎥
⎢ ⎥
⎢1⎥
⎢1⎥
⎢ ⎥
⎣⎢1⎦⎥
Conduct FF-test for significance
⎡ ∧ ⎤ ⎡10 8 ⎤ −1 ⎡33.8⎤
⎢μ∧ ⎥ = ⎢
⎥ ⎢
⎥
⎢ g ⎥ ⎣ 8 10⎦ ⎣31.7⎦
⎣ ⎦
⎡ ∧ ⎤ ⎡ 0.28 − 0.22⎤ ⎡33.8⎤
⎢μ∧ ⎥ = ⎢
⎥⎢
⎥
⎢ g ⎥ ⎣− 0.22 0.28 ⎦ ⎣31.7⎦
⎣ ⎦
⎡ ∧ ⎤ ⎡2.36⎤
⎢ μ∧ ⎥ = ⎢
⎥
⎢ g ⎥ ⎣1.38 ⎦
⎣ ⎦
Hayes ‘07
17
Example results from single SNP analyses
Estimates of Marker Effects for Milk yield US Holsteins
‘Manhattan plot’
National Swine Improvement Federation Symposium, Dec. 2008 (18)
Paul VanRaden
2008
18
9
Sample results – 1 line, 1 chr, 1 trait
-lo g 1 0 (p -v a lu e )
4
1-SNP P-value
Favorable
Unfavorable
3
Allele frequency
2
1
0
1-SNP Estimates
1.0
Favorable
Unfavorable
Allele frequency
0.5
F re q .
E s tim a te /s d
1.5
0.0
Issues with LD mapping using
single marker regression
• Significance testing – e.g. F-test
• Many tests – need to control for false positives
• Could use permutation test - see before – difficult if individuals related
• False ++ because of population structure (see problem set 1)
• Simple model assumes all animals equally (un)related
(un)related = unlikely
• Presence of breeds, strains, or families all create pop.structure
• i.e. presence of extensive genetic relationships
• To try to account for this – fit breed composition (if available)
– fit breed polygenic effect with relationships
• Overestimation of significant SNPs – fit SNP effect as random
20
10
a. Impact of Genetic Relationships
• Results in underestimation of standard errors - E.g. Hassen et al.
JAS’09
Distribution of p-values
100
Freq
80
W/out polygenic effect
120
With polygenic effect
100
Excess of low p-values
Freq
120
80
60
60
40
40
20
20
0
0
• Could also give biased estimates - simple example (Hayes ’07)
07)
– a sire with high EBV has many progeny in the population.
– a rare allele at some SNP is homozygous in the sire (aa
(aa))
– Then subsub-pop. of his progeny has higher frequency of a than overall pop.
– As the sires’
sires’ EBV is high, his progeny will also have higher EBV
– If we don’
don’t account for this, the a allele will appear to have a ++ effect.
21
Extension of 1-SNP model by fitting a
polygenic effect
y = 1n ' μ + Xg + Zu + e
u = vector of polygenic effect with covariance structure u ~ N(0,Aσa2)
A = average relationship matrix built from the pedigree σa2 = genetic var.
Z = design matrix allocating animals to records.
λ=σe2/σa2
Henderson’
Henderson’s Mixed Model Equations:
⎡∧⎤
⎢ μ∧ ⎥ ⎡1n '1n
⎢ g ⎥ = ⎢ X'1
n
⎢∧⎥ ⎢
⎢u ⎥ ⎢⎣ Z'1n
⎢⎣ ⎥⎦
1n ' X
X' X
Z' X
⎤
⎥
⎥
Z' Z + A −1 λ ⎥⎦
1n ' Z
X' Z
−1
⎡1n ' y ⎤
⎢ X' y ⎥
⎥
⎢
⎢⎣ Z' y ⎥⎦
Hayes ‘07
22
11
Example
Hayes ‘07
Animal
1
2
3
4
5
6
Sire
0
0
0
1
1
1
Dam
0
0
0
2
2
3
Phenotype
10.1
2.2
2.31
6.57
6.06
6.21
SNP alleles
Pat
Mat
0
1
1
1
1
1
0
1
0
1
0
1
Simple regression model
y = 1n μ + Xg + e
⎡ ∧ ⎤ ⎡1 '1
⎢ μ∧ ⎥ = ⎢ n n
⎢ g ⎥ ⎣ X'1n
⎣ ⎦
1
2
X = 2
1
1
1
−1
1n ' X ⎤ ⎡1n ' y ⎤
X' X ⎥⎦ ⎢⎣ X' y ⎥⎦
23
The A matrix
Elements = additive genetic relationship
= the proportion of genes shared
See more later (IBD)
Pedigree
Animal
Sire
1
2
3
4
5
6
Dam
0
0
0
1
1
1
Animal 1
Animal 1
Animal 2
Animal 3
Animal 4
Animal 5
Animal 6
Half genes from mum, half from dad
0
0
0
2
2
3
1
0
0
0.5
0.5
0.5
Animals 6 is a half sib of 4 and 5
Animal 2
1
0
0.5
0.5
0
Animal 3
1
0
0
0.5
Animal 4
1
0.5
0.25
Animal 5
Animal 6
1
0.25
Hayes ‘07
1
24
12
Example
Hayes ‘07
Animal
1
2
3
4
5
6
Sire
0
0
0
1
1
1
Dam
0
0
0
2
2
3
Phenotype
10.1
2.2
2.31
6.57
6.06
6.21
SNP alleles
Pat
Mat
0
1
1
1
1
1
0
1
0
1
0
1
y = 1n ' μ + Xg + Zu + e
⎡∧⎤
⎢ μ∧ ⎥ ⎡1n '1n
⎢ g ⎥ = ⎢ X'1
n
⎢∧⎥ ⎢
⎢u ⎥ ⎢⎣ Z'1n
⎢⎣ ⎥⎦
1n ' X
X' X
Z' X
⎤
⎥
⎥
Z' Z + A −1 λ ⎥⎦
1n ' Z
X' Z
−1
⎡1n ' y ⎤
⎢ X' y ⎥
⎥
⎢
⎢⎣ Z' y ⎥⎦
λ= σe2/σa2 = (1(1-h2)/h2 = (1(1-.75)/0.75 = 0.33
Hayes ‘07
25
b. Overestimation of significant SNPs
• Least squares (fixed effect) estimates of SNP effects are
equal to the true value + estimation error: g
ˆ = g +e
gˆ
• Thus, SNPs that are significant tend to have larger
estimation errors – e.g. SNPs with small minor allele freq.
• This can be addressed by fitting SNP effects as random
e.g. assuming g ~ N(0, σg2) for some choice of σg2
Fitting g as random regresses or shrinks estimates back to 0
to account for the lack of information
If the choice of σg2 is correct (?) then the resulting estimates
are BLUP, which have property:
Where peg is the prediction error
g = gˆ + pegˆ
Note the similarity to BLUP estimation of breeding values
Differences between random / fixed are small if the amount of
data is large (Æ small errors) or if λg=σe2/σg2 is small
26
13
Fitting SNP Effects Random vs. fixed
y = 1n ' μ + Xg + Zu + e
Add λg=σe2/σg2 to the diagonal of the X’X matrix
⎡∧⎤
⎢ μ∧ ⎥ ⎡1n '1n
⎢ g ⎥ = ⎢ X'1
n
⎢∧⎥ ⎢
⎢u ⎥ ⎢⎣ Z'1 n
⎣⎢ ⎦⎥
1n ' X
⎤
⎥
X' Z
⎥
−1
Z' Z + A λ ⎥⎦
1n ' Z
X' X + Iλ g
Z' X
−1
⎡1n ' y ⎤
⎢ X' y ⎥
⎢
⎥
⎢⎣ Z' y ⎥⎦
σ g2 could be set such that Xi g explains variance equal to some value = σ M2
Î Var(Xi g ) = σ M2 , which must be solved for σ g2
Using the conditional expectation theorem:
Var(Xi g) = E{Var(Xi g | Xi = k)} + Var{ E(Xi g | Xi = k)}
2
=
∑ Pr( X
k =0
i
= k ) k 2σ g2
= {p2 + (1-p)2} σ g2
+
0
with Xi = -1 , 0 , or 1
= {p2(-1)2 + 2p(1-p)(0)2 + (1-p)2(1)2} σ g2
Î σ g2 = σ M2 /{p2 + (1-p)2}
p = freq. allele 0
27
5. Some other models for LD mapping
a. Some other ‘simple’ models
• SNP Genotype models
• Single SNP regression
u ~ Aσ
Aσa2
yi = μ + Xij gj + ui + ei
Xij= #1 alleles
(0/1/2) - estimates allele substitution effect
Or fit as class variable Æ dominance
• MultiMulti-SNP regression
yi = μ + Xijgj + Xi,j+1 gj+1 + ui + ei
j14
10011001001100110100
01111001001001011010
00100111001000010111
00111011001101101110
j15
01101000001001100010
00011001010001000111
j16
11101001001011101111
01011000001001101010
• Haplotype methods
j13
• Fixed/random haplotype effects
y = Xg + u + e g’ = [μ00 , μ01 , μ10 , μ11]
separate mean for each haplotype
28
14
Slide 28
j13
Composite likelihood
j14
Long and Langley 1999
Fan and Xiong 2002
jdekkers, 8/7/2006
jdekkers, 8/7/2006
j15
Or using any combination of markers, as implemented by Bonnen et al. (Nat. Genet 38 2006)?
Found not to be better by Hong-hua - threshold more stringent because of larger # tests.
jdekkers, 8/7/2006
j16
Mixture distribution for presumed biallelic QTL
jdekkers, 8/7/2006
b. More complex models
• IBD Mixed Models (Meuwissen & Goddard 2000)
y = ZgQ + u + e
gQ ~ N(0,GvσQ2)
Gv = IBD matrix
– see LATER
Cov. from Prob(IBD at Q | markers)
• Combined Linkage Disequilibrium and Linkage
– see LATER
Analysis Models (LD-LA)
• Whole genome analysis methods
Fit all SNPs simultaneously using genomic selection’ type
models (Xu. 2003, Meuwissen et al. 2001)
yi = μ + Σβjgij + (ui) + ei
random (Bayesian) See Module B - Genomic selection 29
r2
Overview of Strategies for QTL mapping
1
c=.001
0.9
0.8
Outbred population
c=.01
Breed/Line cross
0.7
0.6
0.5
Linkage analysis
LD markers
c=.05
0.4
0.3
c=.1
0.2
c=.2
0.1
c=.5
0
0
5
Linkage analysis
LE markers
HS/FS
LD mapping
LD markers
Extended Candidate High
genes density
F2 / BC AIL/RIL families pedigree
Generation
10
15
20
25
LD used
Population wide
Recomb.
Recomb.
1 rnd
>1 rnd
1 rnd
>1 rnd
LD extent
Long
Smaller
Long
Smaller
Denser
Sparse
Denser Few loci
Marker map Sparse
Coverage
Map resol.
resol.
Genome wide
Poor
Better
Within family
Genome wide
Poor
Better
Population wide
>>> 1 round
Small
Local
Dense
Genome
High
LDLD-LA analysis – see later 30
15
4.5
-logP
4.0
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
cM 0
10
20 30
40
50
60 70
80
90 100 110 120 130
Summary
and
Conclusions
• Several population designs and statistical methods are available
to map the QTL landscape
• Most studies to date have used
• line crosses
• within family linkage
Æ Broad QTL peaks
• Candidate gene analyses Æ Single sharp peak
• New technology enables genomegenome-wide LD mapping
Æ Many sharp peaks
some will stand the test of time but many will crumble . . . .
But do we really care where the peaks are – all we need is a good
predictor of breeding value / phenotype . . . . Æ Genomic Selection
31
16
Download