Signatures of Selection Different types of selection leave behind different signatures on the genome Negative selection: reduces variation at the affected site(s) but also at neighboring sites through background selection Positive selection through recent selective sweep: reduces variation flanking the selected site (even if neutral) due to hitchhiking Diversifying selection can increase variation since >1 extreme alleles selected e.g. selection for diverse viral antigens to evade host immune system Balancing selection can increase variation by maintaining >1 allele in population e.g. maintained heterozygosity (sickle cell anemia) OR different alleles in different subpopulations due to fluctuating environments 1 Signatures of Selection Also different methods of looking for these signatures 1. Evolutionary rate within species vs. between species e.g. Ka/Ks ratio & McDonald-Kreitman tests for coding sequences HKA and multi-locus HKA tests for non-coding sequences 2. Frequency spectrum: frequency of different alleles in the population e.g. Tajima’s D … Fay & Wu’s H … Fu & Li’s D* 3. Linkage disequillibrium & Haplotype structure For all of these tests: compare REAL DATA to a MODEL of what data should look like under neutral evolution … can also compare test results at specific loci vs. a scan across the genome 2 Signatures of Selection Also different methods of looking for these signatures 1. Evolutionary rate within species vs. between species e.g. Ka/Ks ratio & McDonald-Kreitman tests for coding sequences HKA and multi-locus HKA tests for non-coding sequences 2. Frequency spectrum: frequency of different alleles in the population e.g. Tajima’s D … Fay & Wu’s H … Fu & Li’s D* 3. Linkage disequillibrium & Haplotype structure For all of these tests: compare REAL DATA to a MODEL of what data should look like under neutral evolution … can also compare test results at specific loci vs. a scan across the genome 3 Methods based on the Allele Frequency Spectrum 1. For each ‘derived’ (=non-ancetsral) allele at a given locus, calculate the frequency. Some alleles will be at high frequencies in the population, some at low frequencies (i.e. very uncommon) 2. Make a histogram of the % of alleles with different frequencies looking for an excess of rare alleles or of common alleles 4 From Nielsen Nat Rev Gen 2005 review Methods based on the Allele Frequency Spectrum Tajima’s D (F. Tajima, 1989): takes the # of segregating sites within species (S) and also the average # difference between each pair of sequences () S=3 = (2 + 2 + 1 + 2) + (2 + 1 + 0) + (1 + 2) +(1) = 1.4 10 pairwise comparisons avg. # difs between each pair of sequences 5 Tajima’s D compares S and to estimate the proportion of low/high-frequency alleles Methods based on the Allele Frequency Spectrum Tajima’s D (F. Tajima, 1989): takes the # of segregating sites within species (S) and also the average # difference between each pair of sequences () S versus reflects on allele frequency Multiple ways to calculate q q = q = S/a Negative Tajima’s D = excess of low-frequency alleles (= reduced variation) ( < S/a) Indicates positive selection, OR recent deleterious alleles, OR population expansion** Positive Tajima’s D = excess of intermediate-frequency alleles ( > S/a) (low amounts of both high- and low-frequency alleles) Indicates balancing selection OR partial sweep OR population bottleneck** How can you get a p-value? Difficult to estimate - best to compare across loci 6 Empirical model for significance of Tajima’s D Sliding window across a locus From Nielsen Nat Rev Gen 2005 review OR Compare to several other loci From Will et al. PLoS Genetics 2010 7 Genome-wide scans of FST FST is a measure of population subdivision: the proportion of the total genetic variance T contained in a subpopulation S relative to the total genetic variance in the species FST = T - S T Where = average # pairwise nucleotide differences per site If S = T (i.e amount of variation in the subpopulation is same as total population) FST = 0 … NO population subdivision If there’s variation in the total sample, but NO variation within each subpopulation S = FST = 1 … COMPLETE differentiation between subpopulations 8 FST = 1: very strong population subdivisions … may be little gene flow between those populations 9 Genome-wide scans of FST Difficult to interpret what a given FST means (FST = 0.15 means ???) But, can use variation in FST across the genome to look for evidence of partial selective sweeps in specific sub-populations: i.e. little gene flow at specific loci only 10 From Akey et al. 2002: FST across each human chromosome LD & Haplotype Structure Linkage equillibrium: when segregation of two different alleles is independent of one another Linkage disequillibrium (LD): segregation of two alleles are NOT random - two SNPs in close proximity are linked physically - can measure the distance over which their association breaks down LD break-down depends on generation time and recombination rate SNPs very close together will take many generations to get separated 12 LD & Haplotype Structure Linkage equillibrium: when segregation of two different alleles is independent of one another Linkage disequillibrium (LD): segregation of two alleles are NOT random - two SNPs in close proximity are linked physically - can measure the distance over which their association breaks down Haplotype: block of linked SNPs Haplotype 1 at Locus A Haplotype 2 at Locus A Haplotype 3 at Locus A 13 LD & Haplotype Structure Remember that a recent selective sweep can reduce variation flanking the advantageous site. The strength of selection and time since sweep affects the degree and length of reduced variation. This effectively creates an unusually long haplotype (compared to others in the genome) 14 EHH: Extended Haplotype Homozygosity test for RECENT positive selection Recent positive selection through partial selective sweep: * extended haplotype length * high frequency in subpopulation must account for regional differences in recombination rates Yoruban Beni African Shona European Asian 15 EHH: Extended Haplotype Homozygosity test for RECENT positive selection EHH = % of individuals sharing CORE haplotype that remain identical out to a distance of x Defined Core Haplotype 16 EHH: Extended Haplotype Homozygosity test for RECENT positive selection Relative EHH: normalize EHH for one haplotype to EHH of all others at that locus internally controls for locus-specific effects African haplotype 17 EHH: Extended Haplotype Homozygosity (& other methods) test for RECENT positive selection Related test from Jonathan Pritchard: iHS test Benefits of EHH & iHS scans: * Don’t have to know populations a priori … define by haplotypes * More sensitive than traditional tests for selection Remaining challenges: * Often have no idea WHY - how to link to phenotypes of interest? Stinchcombe & Hoekstra review: combining scans with QTL mapping * Often unclear what SNP was selected for … identifies huge regions 18 Science. February 12, 2010 CMS incorporates results of 5 different tests: FST iHS & XP-EHH DAF (looking at derived allele frequencies) iHH (looking at absolute haplotype length) 19 CMS outperforms single tests in simulated data 20