Population structure analysis of microsatellite data using STRUCTURE In STRUCTURE you use multi-locus genotype data to investigate population structure. It can be applied to many commonly used genetic markers, including microsatellites, AFLP and SNPs. STRUCTURE differs from GenAlEx in the meaning that it does not (necessarily) use predefined geographical populations; instead it tries to find the most parsimonious set of “genetic clusters” (consisting of multi-locus allele frequencies). In the process of finding the genetic clusters, STRUCTURE will assign the likelihood of the multi-locus genotype of each individual to belong to any of the suggested clusters. The program uses Bayesian methods (likelihood based methods) and Markov chain Monte Carlo (MCMC) simulations. By using MCMC, the search for the most likely genetic clusters paths is biased from all combinations of allele frequencies (A below) towards frequencies with higher likelihoods (B below). The MCMC approach biases the search of the allele frequency space to positions with high likelihood and can in this way find the most likely position in most data sets with moderate computer power. Your analyses in GenAlEx suggested that there is some population genetic structure in great reed warblers (FST ≈ 0.04), and in particular the populations in Kazakhstan and to some extent Sweden were differentiated from the other European populations. Can you confirm this pattern by using the program STRUCTURE? 1. Start STRUCTURE (use version 2.3.3 or later). 2. Create a “New project” by importing the data “Warbler-STRUCTURE.txt”. Number of individuals in the data set is 238; there are 6 loci; missing data are indicated by “-9”. There is a row with marker names; and each individual is represented by a single line. There are columns with individual ID, putative population origin, and sampling locality. Check the imported data set. 3. Set up the analyses. Create a “New parameter set” with “burn in” of 10,000 and set number of MCMC to 40,000. Use an “Admixture model” (this means that you allow some dispersal between populations) and do not use “sampling locations as prior” (this means that you will not use any information of where the birds were sampled). Use default settings for all other options. Call this parameter set “10k40kAdmixture”. 4. Then “Start a job” by choosing the “10k40kAdmixture” parameter set and set K = 1– 8, and 1 iteration. K is the number of “genetic clusters” you wish the program to define. If the program crashes, restart the program and open the project you have created and re-start the job. 1 5. The statistics of the results is given in the Summary Table. Which K has the highest “Estimated Ln Prob of Data”? Estimated Ln Prob of Data: K = 1 ________ K = 2 ________ K = 3 ________ K = 4 ________ K = 5 ________ K = 6 ________ K = 7 ________ K = 8 ________ Most likely K: _______ Select “show bar plot” for K = 2-5. When you have opened the graph, select “Group by POP Id” to show the 8 sampling populations. These graphs show the probability of each individual, grouped according to sampling locations, to belong to any the different clusters. Fill in the pattern in the boxes below (or use a screenshot): K=2 K=3 K=4 K=5 Population: 1-Spa 2-Swe 3-Lat 4-Ger 5-Hun 6-Bel 7-Ukr 8-Kaz Read this: The program can have problems finding population structure when the population differentiation is weak (FST < 0.05) and when there is isolation by distance. This seems to be the case in great reed warblers! In such situations, one can use information about where the samples were collected. You will now test this option. 6. Create a “New parameter set” with 10,000 “burn in” and 40,000 MCMC. Use an “Admixture model” and this time select “use sampling locations as prior”. Use default settings for all other options. Call this parameter set “10k40kAdmixLocation”. 7. “Start a job”. Use “10k40kAdmixLocation”, K = 1–8, and 1 iteration. 2 8. Evaluate the results. Which K has the highest “Estimated Ln Prob of Data”? K = 1 ________ K = 2 ________ K = 3 ________ K = 4 ________ K = 5 ________ K = 6 ________ K = 7 ________ K = 8 ________ Most likely K: _______ Select “show bar plot” for each K. Select “Group by POP Id” to show the 8 sampling populations (see Figure 1). Again, the graphs show the probability of each individual, grouped according to sampling locations, to belong to any the different genetic clusters. Fill in the pattern in the boxes below (or use screenshots): K=2 K=3 K=4 K=5 Population: 1-Spa 2-Swe 3-Lat 4-Ger 5-Hun 6-Bel 7-Ukr 8-Kaz ☺ Compare and discuss the output from the different runs and to the previous FST results: Is there any evidence for population structure? Does STRUCTURE conflict the patterns detected with GenAlEx? Individuals can be assigned to more than one “genetic cluster”, why? Useful link: http://taylor0.biology.ucla.edu/structureHarvester/# This exercise was put together by Bengt Hansson March 19, 2014. 3