BO3010 H05 POPULASJONSGENETIKK ANALYSIS OF GENETIC DIFFERENTIATION AND STRUCTURE (J. Mork, TBS) Evolution can be defined as any change in population gene frequency. Given sufficient time and degree of isolation, the evolutionary forces (mutation, genetic drift, gene flow and selection) will eventually result in different gene frequencies in different populations («time» may mean anything from a few, to hundreds of thousands of generations, depending on population size). There are many models for describing genetic differentiation. One of the best known and most frequently used is Sewall Wright’s «Mainland-Island model». It is based on a situation where a start population («Mainland») is split into many isolated subpopulations («Islands»), and describes the genetic differentiation between these over generations using formulae which includes e.g. population size, migration rates, and the number of generations since population splitting (and thereby reproductive isolation). Wright utilized a specific statistic - the Fst - as a measure of the degree of differentiation. The Fst value tells what proportion of the total genetic variability in the material is caused by genetic differences between populations (the rest is of course due to differences between individuals, i.e. within populations). For example, an Fst value of 0.10 would mean that 10% of the total variation can be attributed to differences between samples. It is worthwhile to mention that because Fst is a relative measure (between/within), its value is not expected to be affected by the type of genetic marker used (i.e., markers with different evolutionary rates, like e.g. isozymes and microsatellites, would be expected to yield similar Fst estimates when applied on the same material). Measures of absolute genetic differences, on the other hand (like Nei’s genetic distance D), are expected to give different results depending on the evolutionary rate (mutation rate) of the actual marker. For example, mini- and micro-satellites are expected to give larger Dvalues than isozymes, and this has been shown to be the case also in practice. WRIGHT’S FST , (A RELATIVE MEASURE OF DIFFERENTIATION) To understand the nature of Fst it is useful to have some knowledge of the Hardy-Weinberg theorem and the socalled Wahlund-effect. The latter tells that in a physical mixture (i.e., not an interbred group) of individuals from two or more populations with different gene frequencies, the mixture will show a deficit of heterozygotes compared to the expected (Hardy-Weinberg) proportion calculated from the joint gene frequency in the mixture. This effect is easy to understand if looking at the extreme situation where two populations with gene frequencies of 1.0 and 0.0, respectively, provide one half each of a mixed sample. The joint gene frequency in the mixed group will necessarily be 0.5, and from this we would expect a proportion of heterozygotes of (0.5*0.5*2=) 0.5 from the Hardy-Weinberg theorem, while the mixed group actually have no heterozygotes! Smaller differences in allele frequencies, or skewed proportions between the groups involved will of course create correspondingly smaller deficits of heterozygotes, but whenever observed, a significant deficit of heterozygotes is an indication that our sample consists of a mixture of two or more populations with different gene frequencies. Observed heterozygosity is simply the proportion of heterozygotes in a sample Expected heterozygosity (H) on a locus, however, is calculated from the observed allele frequencies: H = 1- xi2 where xi is the frequency of the ith allele. Mean expected H is written with a ‘bar’ above it and is the arithmetic average H at all the investigated loci (usually both monomorphic and polymorphic loci are included). Relevant software: Hetzyg.exe (J. Mork) The rationale behind Fst is that by the ‘start’ of differentiation (i,.e., when reproductive isolation occurs), all the (sub)populations have the same allele frequencies and genotype frequencies at all loci. Assume that the allele frequency at a 2-allel polymorphic locus is 0.5. Over generations, the allele frequencies and thereby the genotype frequencies will diverge between populations. The amount of divergence due to genetic drift will depend on population sizes and the number of generations. If, at one point in time, all the genotypes in all the populations are pooled in one large table, the proportion constituted by of heterozygotes will be a lesser number than that calculated (the H-W expectation) from the the ‘joint’ allele frequencies of that mixed group. The deficit will increase with time (generations), until eventually all the (sub)populations are fixed for one or the other allele, BO3010 H05 POPULASJONSGENETIKK and no heterozygotes are observed at all. (The expected proportion, which is based on joint allele frequencies, is however constant and hence the same as in the undivided start population). One way to look at this process is that the genetic variability, which at the start was entirely located within populations, is more and more transformed to be between populations. Fst is actually a measure of the fraction of the total genetic variation which can be attributed to differences between populations (the so-called ‘between’ component). The formula for Fst is: Fst = 1 - Hmean / Htotal where Hmean is the arithmetic mean of the heterozygosities in all the subpopulations, while Htotal is the expected heterozygosity based on the joint allele frequency in pooled subpopulations. It is evident from the formula that Fst equals 0 when the subpopulations are identical in allele frequencies, and 1 when they are fixed for different alleles. Wright’s Fst is basically a measure for single loci. Masatoshi Nei has suggested another statistic which utilizes information from several loci simultaneously. The statistic is analogous to Wright’s Fst , but is called Gst and calculated from allele frequencies rather than genotype frequencies (assuming Hardy-Weinberg equilibria in all subpopulations). Nei’s statistic is called Gst. Relevant software: GSA.exe (J. Mork) NEI’S I OG D (ABSOLUTE MEASURES OF DIFFERENTIATION) Masatoshi Nei has also suggested another measure, called D (genetic distance) which provides an estimate of the absolute genetic differences between populations. («.. mean number of amino acid substitutions per locus»). This statistic utilizes allele frequencies at multiple loci, and is calculated for each locus via the statistic I («genetic Identity»). The formula is: I = xiyi / SQR[(xi2)(yi2)] where xi og yi are frequencies of the i-th allele in population X og Y, respectively (SQR means square root). Furthermore, D = - ln(I). It is common to calculate the arithmetic mean D when dealing with more than one locus. Relevant software: DG25.exe, DG50.exe, DG100.exe (J. Mork), BIOSYS (D. Swofford) CLUSTER ANALYSIS AND DENDROGRAM CONSTRUCTION In studies of intraspecific genetic structure it is recommendable to have information on allele frequencies at many polymorphic loci. An efficient way of illustrating the calculated similarities and differences between groups is to perform cluster analysis. The method outlined below is the UPGMA (Unweighted Paired Group Method of Arithmetic Average), which is one way to present complex matrix data graphically. There are many others. First, the mean I or D between all pairwise combinations are calculated as explained above. The result can, e.g., be arranged in a matrix with the OTUs (Operational Taxonomic Units) along both axes. The two OTUs with the smallest D (or largest I) between them are then fused into one OTU, and the I or D value (the mean of the values of the two original OTUs) for this new OTU towards all the others is recalculated in a new matrix. Then again the two OTUs with the smallest distance are fused, followed by a new re-calculation. This procedure is repeated in a cyclical way until all the OTUs are parts of the same cluster. The dendrogram which can be constructed from this gives a graphical presentation of the similarities between OTUs in the total material. Example (UPGMA cluster analysis of Nei’s genetic distances, and dendrogram construction): BO3010 H05 POPULASJONSGENETIKK Consider samples from 3 populations (OTUs). In each sample, genotypes at 3 loci (called HbI*, LDH-3* and IDHP-1*) with 2 alleles at each are scored by electrophoresis, giving the following values (for sake of simplicity, the alleles are called S and F, and the genotypes thus SS, SF, and FF at all loci. qF and qS are the calculated allele frequencies of F and S): Population 1: Locus HbI* LDH-3* IDHP-1* genotype SS 25 81 36 genotype SF 50 18 48 genotype FF 25 1 16 N 100 100 100 qF 0.5 0.1 0.6 qS 0.5 0.9 0.4 Population 2: Locus HbI* LDH-3* IDHP-1* genotype SS 81 25 25 genotype SF 18 50 50 genotype FF 1 25 25 N 100 100 100 qF 0.9 0.5 0.5 qS 0.1 0.5 0.5 Population 3: Locus HbI* LDH-3* IDHP-1* genotype SS 64 36 49 genotype SF 32 48 42 genotype FF 4 16 9 N 100 100 100 qF 0.8 0.6 0.7 qS 0.2 0.4 0.3 Calculation of genetic distances: Formulae: I = xiyi / SQR[ (xi2)(yi2)], and D = - ln(I) Calculation of I-values and D-values from observed allele frequencies: Population 1 versus population 2: HbI*: I = (0.5*0.9) / SQR[(0.25+0.81)*(0.25+0.01)] = 0.7810 LDH-3*: I = (0.9+0.5) / SQR[(0.81+0.25)*(0.01+0.25)] = 0.7810 IDHP-1*: I = (0.6+0.5)*(0.4+0.5) / SQR [(0.36+0.16)*(0.25+0.25)] = 0.9803 Mean I = (0.7810+0.7810+0.9803) / 3 = 0.8474 Mean D = -ln(0.8474) = 0.1656 Population 1 versus population 3: HbI*: I = 0.8575 LDH-3*: I = 0.8274 IDHP-1*: I = 0.9820 Mean I = 0.8890 Mean D = 0.1180 Population 2 versus population 3: HbI*: I = 0.9906 LDH-3*: I = 0.9803 IDGP-1*: I = 0.9285 Mean I = 0.9665 Mean D = 0.034 BO3010 H05 POPULASJONSGENETIKK Presenting the results of calculations in the first cycle in matrix form: Matrix 1 Population 1 Population 2 Population 3 Population 1 -0.1656 0.1180 Standard Genetic Distances (Nei 1972) Population 2 Population 3 -0.034 -- The smallest value of pairwise genetic distance in this matrix is between populations 2 and 3. Therefore, these two populations are combined into one (and will be connected by the lowest level bifurcation in the dendrogram). The genetic distance between this ‘combined’ population and population 1 is then calculated as the arithmetic mean of the two distances that population 2 and population 3 originally had towards population 1, i.e. mean D=(0.1656+0.1160)/2 = 0.1408, and a new matrix can be filled in: Matrix 2 Population 1 Population (2+3) Standard Genetic Distances (Nei 1972) Population 1 Population (2 + 3) -0.1408 -- This procedure of joining the nodes with the smallest D-value in each cycle and then recalculating the matrix proceeds until all populations have been joined. In the current example with three population there will be two nodes (population 1 and the combined population 2/3) in the dendrogram which can be drawn on basis of the values in matrices 1 and 2: Among relevant software for cluster analysis and dendrogram construction are e.g.: DG25.exe, DG50.exe, DG100.exe (J. Mork), BIOSYS (D. Swofford), GNKDST (M. Nei) oooooooooooooOOOOOOOOOooooooooooooo