
Population structure analysis of microsatellite
data using STRUCTURE
In STRUCTURE you use multi-locus genotype data to investigate population structure. It
can be applied to many commonly used genetic markers, including microsatellites, AFLP
and SNPs. STRUCTURE differs from GenAlEx in the meaning that it does not
(necessarily) use predefined geographical populations; instead it tries to find the most
parsimonious set of “genetic clusters” (consisting of multi-locus allele frequencies). In the
process of finding the genetic clusters, STRUCTURE will assign the likelihood of the
multi-locus genotype of each individual to belong to any of the suggested clusters.
The program uses Bayesian methods (likelihood based methods) and Markov chain
Monte Carlo (MCMC) simulations. By using MCMC, the search for the most likely
genetic clusters paths is biased from all combinations of allele frequencies (A below)
towards frequencies with higher likelihoods (B below).
The MCMC approach biases the search of the allele frequency space to positions with
high likelihood and can in this way find the most likely position in most data sets with
moderate computer power.
Your analyses in GenAlEx suggested that there is some population genetic structure in
great reed warblers (FST ≈ 0.04), and in particular the populations in Kazakhstan and to
some extent Sweden were differentiated from the other European populations. Can you
confirm this pattern by using the program STRUCTURE?
1. Start STRUCTURE (use version 2.3.3 or later).
2. Create a “New project” by importing the data “Warbler-STRUCTURE.txt”. Number of
individuals in the data set is 238; there are 6 loci; missing data are indicated by “-9”.
There is a row with marker names; and each individual is represented by a single
line. There are columns with individual ID, putative population origin, and sampling
locality. Check the imported data set.
3. Set up the analyses. Create a “New parameter set” with “burn in” of 10,000 and set
number of MCMC to 40,000. Use an “Admixture model” (this means that you allow
some dispersal between populations) and do not use “sampling locations as prior”
(this means that you will not use any information of where the birds were sampled).
Use default settings for all other options. Call this parameter set “10k40kAdmixture”.
4. Then “Start a job” by choosing the “10k40kAdmixture” parameter set and set K = 1–
8, and 1 iteration. K is the number of “genetic clusters” you wish the program to
define. If the program crashes, restart the program and open the project you have
created and re-start the job.
5. The statistics of the results is given in the Summary Table. Which K has the highest
“Estimated Ln Prob of Data”?
Estimated Ln Prob of Data:
K = 1 ________ K = 2 ________ K = 3 ________
K = 4 ________ K = 5 ________ K = 6 ________
K = 7 ________ K = 8 ________
Most likely K: _______
Select “show bar plot” for K = 2-5. When you have opened the graph, select “Group
by POP Id” to show the 8 sampling populations. These graphs show the probability
of each individual, grouped according to sampling locations, to belong to any the
different clusters.
Fill in the pattern in the boxes below (or use a screenshot):
Population: 1-Spa
3-Lat 4-Ger
6-Bel 7-Ukr 8-Kaz
Read this: The program can have problems finding population structure when the
population differentiation is weak (FST < 0.05) and when there is isolation by distance.
This seems to be the case in great reed warblers! In such situations, one can use
information about where the samples were collected. You will now test this option.
6. Create a “New parameter set” with 10,000 “burn in” and 40,000 MCMC. Use an
“Admixture model” and this time select “use sampling locations as prior”. Use default
settings for all other options. Call this parameter set “10k40kAdmixLocation”.
“Start a job”. Use “10k40kAdmixLocation”, K = 1–8, and 1 iteration.
8. Evaluate the results. Which K has the highest “Estimated Ln Prob of Data”?
K = 1 ________ K = 2 ________ K = 3 ________
K = 4 ________ K = 5 ________ K = 6 ________
K = 7 ________ K = 8 ________
Most likely K: _______
Select “show bar plot” for each K. Select “Group by POP Id” to show the 8 sampling
populations (see Figure 1). Again, the graphs show the probability of each individual,
grouped according to sampling locations, to belong to any the different genetic
Fill in the pattern in the boxes below (or use screenshots):
Population: 1-Spa
3-Lat 4-Ger
6-Bel 7-Ukr 8-Kaz
☺ Compare and discuss the output from the different runs and to the previous FST
Is there any evidence for population structure?
Does STRUCTURE conflict the patterns detected with GenAlEx?
Individuals can be assigned to more than one “genetic cluster”, why?
Useful link:
This exercise was put together by Bengt Hansson March 19, 2014.