Comparing human populations to identify positive selection. Practical. Simon Myers and Gil McVean The Concept As you saw in the first practical, mutations under positive selection can arise and then reach high frequency in a population (and eventually fix). This has important implications when we are comparing diversity in two populations, for example in humans. Although most diversity is shared among human groups, some genetic differences exist between human populations as a result of historical separation – for example between Yorubans from Nigeria, and Europeans from Utah, two populations that have been separated at least ~50,000 years, and for whom extensive variation data has already been gathered, by the HapMap project1. Two possible scenarios might lead to some positively selected mutations looking more different between separated populations than “typical” mutations. First, the two groups A and B may live in different environments, so selection on an existing mutation may only act in one population, say A. In this case, this mutation might have reached high frequency in population A, where it is advantageous, but still be rare in population B. Secondly, selective forces may be the same in both groups, but a new advantageous mutation may arise in population A after the populations separated. In this case, selection may again allow the mutation to reach high frequency in population A (more rapidly than a typical mutation would) and it will be virtually absent in population B. Either case can result in the same signal – a mutation at very different frequencies in populations A and B. Looking for such unusually differentiated SNPs can then help identify selection acting in one or both populations. The key to finding such mutations is to identify particular SNPs (Single Nucleotide Polymorphisms), based on diversity surveys, that are unusually diverged between groups based on the distribution of SNPs in the genome. Knowledge of the function of a particular mutation can add evidence and aid understanding about the selection operating. The Data The file “chrom15data.txt” contains the population frequencies of SNPs, based on the HapMap data1, in a region of the human genome of length 5 million bases and on chromosome 15. The file contains six columns giving first a unique name for the SNP (called the rs identifier), its position in bases along the chromosome, and then the frequency of the SNP observed in samples from each of four populations originating from Europe (CEU), Africa (YRI), Japan (JPT) and China (CHB), respectively. Tasks 1. Load in “chrom15data.txt”. To do this and store the result as “data”, you can use the following commands: fid = fopen('chrom15data.txt', 'r'); data = textscan(fid, '%s %n %n %n %n %n'); fclose(fid); For each pair of populations A and B, plot the column giving SNP frequency in population A versus the column of SNP frequencies in population B. (There will be 6 plots altogether.) Which pairs of groups show the most similar and most different allele frequencies? Can you explain this? 2. Find the SNP(s) with the largest difference in absolute frequency between CEU and YRI, between CEU and JPT and between CEU and CHB. Use the results to identify the single SNP which most strongly separates CEU and the other three groups. 3. A more formal measure of differentiation between two groups A and B at a particular SNP is given by FST. For a single SNP, this is calculated via the following formula: FST 1.0 f A (1 f A ) f B (1 f B ) 2 f (1 f ) Here fA and fB are the observed SNP frequencies for populations A and B, and f is the mean SNP frequency: f fA fB 2 FST measures the proportion of variation at the SNP explained by between population variability, relative to the total variation when populations are combined, and lies between 0 and 1. Larger FST values indicate more differentiated SNPs. Calculate and store a vector of single SNP FST values between CEU and JPT for each of the 1889 SNPs in the region. Make a plot of SNP position against the FST value. Which SNP has the largest FST between these two populations? 4. Look for functional evidence at the SNP identified in 2. and 3. by going to the following URL, which is the genome browser for the HapMap: http://hapmap.ncbi.nlm.nih.gov/cgi-perl/gbrowse/hapmap27_B36/ Type in the rs identifier of the SNP you have found into the “Landmark or Region” box and press enter. What results have genome-wide association studies found for this SNP? By zooming out, find the nearest gene to the SNP you have found. Click on this gene to obtain additional information. 5. What trait do these results suggest selection may have acted on, in human populations? Reference 1. Frazer, K.A. et al. A second generation human haplotype map of over 3.1 million SNPs. Nature 449, 851-61 (2007).