ANALYSIS OF GENETIC DIFFERENTIATION AND STRUCTURE

advertisement
BO3010 H05
POPULASJONSGENETIKK
ANALYSIS OF GENETIC DIFFERENTIATION AND
STRUCTURE
(J. Mork, TBS)
Evolution can be defined as any change in population gene frequency. Given sufficient time and degree of
isolation, the evolutionary forces (mutation, genetic drift, gene flow and selection) will eventually result in
different gene frequencies in different populations («time» may mean anything from a few, to hundreds of
thousands of generations, depending on population size).
There are many models for describing genetic differentiation. One of the best known and most frequently used is
Sewall Wright’s «Mainland-Island model». It is based on a situation where a start population («Mainland») is
split into many isolated subpopulations («Islands»), and describes the genetic differentiation between these over
generations using formulae which includes e.g. population size, migration rates, and the number of generations
since population splitting (and thereby reproductive isolation). Wright utilized a specific statistic - the Fst - as a
measure of the degree of differentiation. The Fst value tells what proportion of the total genetic variability in the
material is caused by genetic differences between populations (the rest is of course due to differences between
individuals, i.e. within populations). For example, an Fst value of 0.10 would mean that 10% of the total
variation can be attributed to differences between samples. It is worthwhile to mention that because Fst is a
relative measure (between/within), its value is not expected to be affected by the type of genetic marker used (i.e.,
markers with different evolutionary rates, like e.g. isozymes and microsatellites, would be expected to yield
similar Fst estimates when applied on the same material). Measures of absolute genetic differences, on the other
hand (like Nei’s genetic distance D), are expected to give different results depending on the evolutionary rate
(mutation rate) of the actual marker. For example, mini- and micro-satellites are expected to give larger Dvalues than isozymes, and this has been shown to be the case also in practice.
WRIGHT’S FST , (A RELATIVE MEASURE OF DIFFERENTIATION)
To understand the nature of Fst it is useful to have some knowledge of the Hardy-Weinberg theorem and the
socalled Wahlund-effect. The latter tells that in a physical mixture (i.e., not an interbred group) of individuals
from two or more populations with different gene frequencies, the mixture will show a deficit of heterozygotes
compared to the expected (Hardy-Weinberg) proportion calculated from the joint gene frequency in the mixture.
This effect is easy to understand if looking at the extreme situation where two populations with gene frequencies
of 1.0 and 0.0, respectively, provide one half each of a mixed sample. The joint gene frequency in the mixed
group will necessarily be 0.5, and from this we would expect a proportion of heterozygotes of (0.5*0.5*2=) 0.5
from the Hardy-Weinberg theorem, while the mixed group actually have no heterozygotes! Smaller differences
in allele frequencies, or skewed proportions between the groups involved will of course create correspondingly
smaller deficits of heterozygotes, but whenever observed, a significant deficit of heterozygotes is an indication
that our sample consists of a mixture of two or more populations with different gene frequencies.
Observed heterozygosity is simply the proportion of heterozygotes in a sample
Expected heterozygosity (H) on a locus, however, is calculated from the observed allele frequencies:
H = 1- xi2
where xi is the frequency of the ith allele. Mean expected H is written with a ‘bar’ above it and is the arithmetic
average H at all the investigated loci (usually both monomorphic and polymorphic loci are included). Relevant
software: Hetzyg.exe (J. Mork)
The rationale behind Fst is that by the ‘start’ of differentiation (i,.e., when reproductive isolation occurs), all the
(sub)populations have the same allele frequencies and genotype frequencies at all loci. Assume that the allele
frequency at a 2-allel polymorphic locus is 0.5. Over generations, the allele frequencies and thereby the genotype
frequencies will diverge between populations. The amount of divergence due to genetic drift will depend on
population sizes and the number of generations. If, at one point in time, all the genotypes in all the populations
are pooled in one large table, the proportion constituted by of heterozygotes will be a lesser number than that
calculated (the H-W expectation) from the the ‘joint’ allele frequencies of that mixed group. The deficit will
increase with time (generations), until eventually all the (sub)populations are fixed for one or the other allele,
BO3010 H05
POPULASJONSGENETIKK
and no heterozygotes are observed at all. (The expected proportion, which is based on joint allele frequencies, is
however constant and hence the same as in the undivided start population).
One way to look at this process is that the genetic variability, which at the start was entirely located within
populations, is more and more transformed to be between populations. Fst is actually a measure of the fraction of
the total genetic variation which can be attributed to differences between populations (the so-called ‘between’
component).
The formula for Fst is:
Fst = 1 - Hmean / Htotal
where Hmean is the arithmetic mean of the heterozygosities in all the subpopulations, while Htotal is the expected
heterozygosity based on the joint allele frequency in pooled subpopulations. It is evident from the formula that
Fst equals 0 when the subpopulations are identical in allele frequencies, and 1 when they are fixed for different
alleles.
Wright’s Fst is basically a measure for single loci. Masatoshi Nei has suggested another statistic which utilizes
information from several loci simultaneously. The statistic is analogous to Wright’s Fst , but is called Gst and
calculated from allele frequencies rather than genotype frequencies (assuming Hardy-Weinberg equilibria in all
subpopulations). Nei’s statistic is called Gst. Relevant software: GSA.exe (J. Mork)
NEI’S I OG D (ABSOLUTE MEASURES OF DIFFERENTIATION)
Masatoshi Nei has also suggested another measure, called D (genetic distance) which provides an estimate of the
absolute genetic differences between populations. («.. mean number of amino acid substitutions per locus»). This
statistic utilizes allele frequencies at multiple loci, and is calculated for each locus via the statistic I («genetic
Identity»). The formula is:
I = xiyi / SQR[(xi2)(yi2)]
where xi og yi are frequencies of the i-th allele in population X og Y, respectively (SQR means square root).
Furthermore,
D = - ln(I).
It is common to calculate the arithmetic mean D when dealing with more than one locus.
Relevant software: DG25.exe, DG50.exe, DG100.exe (J. Mork), BIOSYS (D. Swofford)
CLUSTER ANALYSIS AND DENDROGRAM CONSTRUCTION
In studies of intraspecific genetic structure it is recommendable to have information on allele frequencies at
many polymorphic loci. An efficient way of illustrating the calculated similarities and differences between
groups is to perform cluster analysis. The method outlined below is the UPGMA (Unweighted Paired Group
Method of Arithmetic Average), which is one way to present complex matrix data graphically. There are many
others.
First, the mean I or D between all pairwise combinations are calculated as explained above. The result can, e.g.,
be arranged in a matrix with the OTUs (Operational Taxonomic Units) along both axes.
The two OTUs with the smallest D (or largest I) between them are then fused into one OTU, and the I or D value
(the mean of the values of the two original OTUs) for this new OTU towards all the others is recalculated in a
new matrix. Then again the two OTUs with the smallest distance are fused, followed by a new re-calculation.
This procedure is repeated in a cyclical way until all the OTUs are parts of the same cluster.
The dendrogram which can be constructed from this gives a graphical presentation of the similarities between
OTUs in the total material.
Example (UPGMA cluster analysis of Nei’s genetic distances, and dendrogram construction):
BO3010 H05
POPULASJONSGENETIKK
Consider samples from 3 populations (OTUs). In each sample, genotypes at 3 loci (called HbI*, LDH-3* and
IDHP-1*) with 2 alleles at each are scored by electrophoresis, giving the following values (for sake of simplicity,
the alleles are called S and F, and the genotypes thus SS, SF, and FF at all loci. qF and qS are the calculated
allele frequencies of F and S):
Population 1:
Locus
HbI*
LDH-3*
IDHP-1*
genotype SS
25
81
36
genotype SF
50
18
48
genotype FF
25
1
16
N
100
100
100
qF
0.5
0.1
0.6
qS
0.5
0.9
0.4
Population 2:
Locus
HbI*
LDH-3*
IDHP-1*
genotype SS
81
25
25
genotype SF
18
50
50
genotype FF
1
25
25
N
100
100
100
qF
0.9
0.5
0.5
qS
0.1
0.5
0.5
Population 3:
Locus
HbI*
LDH-3*
IDHP-1*
genotype SS
64
36
49
genotype SF
32
48
42
genotype FF
4
16
9
N
100
100
100
qF
0.8
0.6
0.7
qS
0.2
0.4
0.3
Calculation of genetic distances:
Formulae:
I = xiyi / SQR[ (xi2)(yi2)],
and
D = - ln(I)
Calculation of I-values and D-values from observed allele frequencies:
Population 1 versus population 2:
HbI*:
I = (0.5*0.9) / SQR[(0.25+0.81)*(0.25+0.01)] = 0.7810
LDH-3*:
I = (0.9+0.5) / SQR[(0.81+0.25)*(0.01+0.25)] = 0.7810
IDHP-1*:
I = (0.6+0.5)*(0.4+0.5) / SQR [(0.36+0.16)*(0.25+0.25)] = 0.9803
Mean
I = (0.7810+0.7810+0.9803) / 3 = 0.8474
Mean
D = -ln(0.8474) = 0.1656
Population 1 versus population 3:
HbI*:
I = 0.8575
LDH-3*:
I = 0.8274
IDHP-1*:
I = 0.9820
Mean
I = 0.8890
Mean
D = 0.1180
Population 2 versus population 3:
HbI*:
I = 0.9906
LDH-3*:
I = 0.9803
IDGP-1*:
I = 0.9285
Mean
I = 0.9665
Mean
D = 0.034
BO3010 H05
POPULASJONSGENETIKK
Presenting the results of calculations in the first cycle in matrix form:
Matrix 1
Population 1
Population 2
Population 3
Population 1
-0.1656
0.1180
Standard Genetic Distances (Nei 1972)
Population 2
Population 3
-0.034
--
The smallest value of pairwise genetic distance in this matrix is between populations 2 and 3. Therefore, these
two populations are combined into one (and will be connected by the lowest level bifurcation in the dendrogram).
The genetic distance between this ‘combined’ population and population 1 is then calculated as the arithmetic
mean of the two distances that population 2 and population 3 originally had towards population 1, i.e. mean
D=(0.1656+0.1160)/2 = 0.1408, and a new matrix can be filled in:
Matrix 2
Population 1
Population (2+3)
Standard Genetic Distances (Nei 1972)
Population 1
Population (2 + 3)
-0.1408
--
This procedure of joining the nodes with the smallest D-value in each cycle and then recalculating the matrix
proceeds until all populations have been joined. In the current example with three population there will be two
nodes (population 1 and the combined population 2/3) in the dendrogram which can be drawn on basis of the
values in matrices 1 and 2:
Among relevant software for cluster analysis and dendrogram construction are e.g.:
DG25.exe, DG50.exe, DG100.exe (J. Mork), BIOSYS (D. Swofford), GNKDST (M. Nei)
oooooooooooooOOOOOOOOOooooooooooooo
Download