doc

advertisement
Using Geneland to Map Genetic Structure in R
Eve McCulloch (emccul1@lsu.edu)
12/04/09
This self-tutorial goes step-by-step to help you use Geneland to 1) analyze spatially referenced
genotypic data, and 2) create simulated populations. It uses two dataset that will be available on
the “R for Ecologists” website:
http://www.biology.lsu.edu/webfac/kharms/BIOL7901Fall2009.htm
The first dataset is a matrix of coordinates for geo-referenced individuals. The second dataset is a
matrix of genotypes for 180 diploid individuals at 10 loci. I created this dataset using a
simulation in SIMCOAL2 with the following conditions: Number of populations = 3, effective
population sizes of 500, generations since separation = 25, and no migration since separation.
The package Geneland v 3.1.5 (Guillet et al 2005b) uses multilocus genetic data to infer
population genetic structure, optionally in a spatially explicit context (geo-referenced individual
genotypes). Specifically, it’s main functions are as follows: 1) Estimating number of
subpopulations and locating their spatial boundaries, 2) calculating F-statistics, 3) creating
graphical output of the spatial distribution of the subdivided populations (based on posterior
probabilities), and 4) simulating population divergence under isolation by distance (IBD) and
barrier models.
The package performs Bayesian inference of all parameters involved using Markov Chain MontCarlo (MCMC) simulations. The overall population is assumed to consist of subpopulations in
Hardy-Weinberg and Linkage Equilibrium (HWLE), and allele frequencies assumed drawn from
a correlated frequency model, though uncorrelated frequencies can be handled. Geneland can
account for null alleles.
There are a number of functions that I am not reviewing; however, the documentation that comes
with Geneland is extensive and a good source for any follow-up questions. For example,
Geneland can be used with ArcGIS or other map-producing programs, and graphics can be
modified beyond what I go into here.
Documentation on this package is available both from CRAN (http://cran.rproject.org/web/packages/Geneland/index.html) and on the Geneland homepage
(http://www2.imm.dtu.dk/~gigu/Geneland/).
References:
G. Guillot. Inference of structure in subdivided populations at low levels of genetic
differentiation.The correlated allele frequencies model revisited. Bionformatics, 24:2222–2228,
2008.
G. Guillot and M. Foll. Accounting for the ascertainment bias in Markov chain Monte Carlo
inferences of population structure. Bioinformatics, 25(4):552–554, 2009a.
-1-
G. Guillot and F. Santos. A computer program to simulate multilocus genotype data with
spatially auto- correlated allele frequencies. Molecular Ecology Resources, 9(4):1112 – 1120,
2009b.
G. Guillot, A. Estoup, F. Mortier, and J.F. Cosson. A spatial statistical model for landscape
genetics. Genetics, 170(3):1261–1280, 2005a.
G. Guillot, F. Mortier, and A. Estoup. Geneland: A computer package for landscape genetics.
Molecular Ecology Notes, 5(3):708–711, 2005b.
G. Guillot, F. Santos, and A. Estoup. Analysing georeferenced population genetics data with
Geneland: a new algorithm to deal with null alleles and a friendly graphical user interface.
Bioinformatics, 24(11): 1406–1407, 2008.
GETTING STARTED:
Install and load the package “Geneland”:
> install.packages(“Geneland”)
Geneland also depends on the following packages,
so be sure they are installed: RandomFields, fields,
mapproj, maps, snow, tcltk
> library(Geneland)
Load the package
Graphical interface (optional):
Geneland has the option of using a graphical interface for all the analyses. To do so, simply use
the command below:
> Geneland.GUI()
Call the graphical interface
However, for the purpose of this tutorial I will focus on the underlying R commands.
Input files and create data tables for genotypic and coordinate data:
> coord<-read.table("c:/coord.txt", header=F)
Create coordinates data table where every row is an individual; coordinates should be planar
and spherical coordinates can be converted to planar in Geneland (see manual)
> msat1<-read.table("c:/genotypes.txt", header=F)
Create table of genetic data with one line per individual two columns per locus
You should check to make sure your data tables were created correctly:
-2-
> dim(coord)
> nrow(msat1)
> ncol(msat1)
Dimensions should be 180 rows x 2 columns
Number of rows should be 180 (individuals)
Number of columns should be 20 (diploid individuals
scored at 10 microsatellite markers)
To plot the geo-referenced individuals:
> plot(coord, xlab="Eastings", ylab="Northings", asp=1)
I manually and somewhat arbitrarily assigned coordinates to the simulated genotypic data, thus
the odd patterning of the points:
You can format the genetic data, microsatellite fragments in this case, as alleles coded by
positive integers (you can also use SNPs and sequence data, but they have to recoded as if they
were alleles). Calling the function FormatGenotypes() produces the output "genotypes" and
"allele.numbers" (number of possible alleles per locus):
> msat1_format<-FormatGenotypes(msat1)
> geno1<-msat1_format$genotypes
> allele.no1<-msat1_format$allele.numbers
If you want to calculate classical F-statistics you will need to label each individual as belonging
to a certain population. To do so you could read in a data containing this information from a .txt
-3-
file, or in this case I simply created a vector numerical values corresponding to population
membership:
> pop.mbrship1<-rep(c(1,2,3), each=60)
The command "each" means change the concatenated numbers (1,2,3) after 60 repetitions,
whereas using the command "times", for example, would mean that 123... repeats for 60 times
MCMC ANALYSIS:
MCMC inference:
> MCMC(coordinates=coord, genotypes=geno1, varnpop=TRUE,
npopmax=5,spatial=TRUE, freq.model="Correlated", nit=100000, thinning=100,
path.mcmc="c:/folder_name/")
The function MCMC() runs MCMC simulations to infer the number of populations and their
spatial boundaries. The command “varnpop” is TRUE if the number of HWLE populations is
unknown and hence are treated as simulated variables;“npopnmax” set the maximum possible
number of subpopulations in your system; “spatial” specifies whether you want to analyze
geo-referenced data with a spatial (TRUE) or non-spatial (FALSE) prior. If you select FALSE
then coordinates are not used in the inference algorithm, though they are used for the
graphical representations; “freq.model” specifies whether allele frequencies in the
subpopulations are correlated or uncorrelated; “nit” is the number of iterations in the MCMC
chain; “thinning” designates how often the results will be saved (in this case, every 100
iterations). Finally, the command "path.mcmc” designates a file for your output to save in, but
it has to be in your current working directory. More commands are possible than I use here.
Post-processing MCMC output(s):
> PostProcessChain(coordinates=coord, genotypes=geno1,
path.mcmc="c:/folder_name/", nxdom=100, nydom=100, burnin=200)
The function PostProcessChain() extracts information from the MCMC analysis that is stored
in the directory specified by the argument “path.mcmc” and creates files required for final
estimations and creating maps: “nxdom” specifies the number of pixels for the horizontal
domain (x coordinates) and “nydom” specifies the number of pixels for vertical area of study
domain.
Generating graphical and numerical output:
You can plot the number of population clusters predicted by MCMC, and save the output in a file
of your choosing (use command “printit=TRUE” and then designate the file and file format):
-4-
> Plotnpop(path.mcmc="c:/folder_name/", burnin=200, printit=TRUE,
file="c:/folder_name/No_Clusters1.pdf", format="pdf")
The plot displays a clear mode at populations K=3. This is what the plot looks like:
You can also create a map of the posterior probability of population membership:
> PosteriorMode(coordinates=coord,path.mcmc="c:/folder_name/", printit=TRUE,
file="c:/folder_name/map1.pdf", format="pdf")
This is what the map looks like: black dots are geo-referenced individual genotypes, and colors
correspond to population membership based on the mode of the posterior probabilities:
-5-
You can make more graphics; however, I do not know how yet. For example, using the graphical
interface (“Geneland. GUI()”) you can create a map of the posterior probability “topography” of
any given pixel belonging to “Cluster” 1, 2 or 3 (subpopulations). Here I show only
subpopulations 1 and 2.
-6-
Checking for MCMC CONVERGENCE:
You can make a loop calling for multiple runs of MCMC inference, and then sort runs by
decreasing average posterior probability:
> nrun<-5
> burnin<-10
I call for only 5 runs to save time
I set a small burnin because I’m using few
iterations, again to save time
> for(irun in 1:nrun)
{
# Define the path to the MCMC directory
path.mcmc<-paste("c:/ folder_name_",irun, "/", sep="")
The function “paste” concatenates the components contained in the (): in this case
you are setting path.mcmc (in which you will keep the output of these different runs),
to be in additional created folders with the name “folder_name_1”,
“folder_name_2” etc, for 1 through nrun.
system(paste("mkdir ", path.mcmc))
“mkdir” makes a new directory
MCMC(coordinates=coord, genotypes=geno1, varnpop=TRUE,
npopmax=5,spatial=TRUE, freq.model="Correlated", nit=10000,
thinning=100,path.mcmc=path.mcmc)
# MCMC post-processing:
PostProcessChain(coordinates=coord, genotypes=geno1,
path.mcmc=path.mcmc, nxdom=200,nydom=200,burnin=burnin)
}
Then you can compute the average posterior probability (with burnin of 10 out of the 1000 saved
iterations)
> lpd<-rep(nrun)
> for(irun in 1:nrun)
{
path.mcmc<-paste("c:/folder_name_”, irun, "/", sep="")
path.lpd<-paste(path.mcmc,"log.posterior.density.txt", sep="")
lpd[irun]<-mean(scan(path.lpd)[-(1:burnin)])
}
Finally, sort the runs by decreasing average posterior probability, and you can choose the best of
them:
> order(lpd, decreasing=TRUE)
-7-
With each of these runs you can examine the output, plot or graph it, as you already did above.
F-STATISTICS:
You can calculate F-statistics according to Weir & Cockerham’s estimators in the classic way
where each individual is pre-assigned to a population:
> Fstat(geno1, npop=3, pop.mbrship1)
The command “Fstat” returns FST and FIS, though I couldn’t figure out how to get a p-value. It
may not be possible.
You can also compute F-statistic based on the output of the MCMC inference:
> Fstat.output(genotypes=geno1,path.mcmc="c:/folder_name/")
In this case, population membership is not assigned based by you, but determined by the
posterior probabilities.
SIMULATIONS:
You can simulate a population of geo-referenced genotypes. Below I show an example from the
Geneland manual:
> simdata1<-simFmodel(nindiv=100,
coord.lim=c(0,1,0,1),
number.nuclei=15,
nall=rep(10,20),
npop=3,
freq.model="Correlated",
drift=rep(0.04,3),
dominance="Codominant")
To visualize your data and prepare it for analysis:
> summary(simdata1)
-8-
100 individuals in total
Sets simulation limits in the unit
square
Tessellation driven by 15
polygons
20 loci with 10 alleles each
Three subpopulations
Correlated frequency model
Sets rate of drift in the 3
populations?
Co-dominant genotypes (two
columns per locus)
> sim_geno1<-simdata1$genotypes
> sim_coord1<-simdata1$coordinates
Once you have this simulated dataset you can run MCMC inference and process the MCMC
output:
> MCMC(coordinates=sim_coord1, genotypes=sim_geno1, varnpop=TRUE, npopmax=5,
spatial=TRUE, freq.model="Correlated", nit=100000, thinning=100,
path.mcmc="c:/new_folder_sim/")
> PostProcessChain(coordinates=sim_coord1, genotypes=sim_geno1,
path.mcmc="c:/new_folder_sim/", nxdom=100, nydom=100, burnin=200)
You can generate graphical and numerical output:
>Plotnpop(path.mcmc="c:/new_folder_sim/", burnin=200, printit=TRUE, file="c:/
new_folder_sim/No_Clusters_sim.pdf", format="pdf")
> PosteriorMode(coordinates=sim_coord1, path.mcmc="c:/new_folder_sim/",
printit=TRUE, file="c:/ new_folder_sim/map_sim.pdf", format="pdf")
Here is the map:
-9-
Simulating population data under a model of IBD and barriers to gene flow:
This is a function for simulation under the model described in Guillot and Santos (2009b), and it
simulates genotypes that are structured by IBD and barriers. It has a lot of possible arguments, so
I suggest reading more about it. I took this example directly from the Geneland manual; it
simulates genotypes of 100 individuals at 3 loci with 5 alleles at each locus:
> sim_data2<-simdata(nindiv=100, number.nuclei=10, allele.numbers=rep(5,3),
model="stable", IBD=TRUE, alpha=1, beta=1, gamma=1, npop=3,
give.tess.grid=TRUE, give.freq.grid=TRUE, npix=c(100,100), comp.Fst=TRUE,
comp.diff=TRUE, width=0.1, plot.pairs.borders=FALSE)
> summary(sim_data2)
To visualize the values returned
> sim_geno2<-sim_data2$genotypes
> sim_coord2<-sim_data2$coord.indiv
> sim_color.nuclei<-sim_data2$color.nuclei[sim_data2$nearest.nucleus.indiv]
Shows population (cluster) membership of the individuals
- 10 -
You can check the dimensions of your data tables and take a look at them:
> dim(sim_coord2)
> dim(sim_geno2)
> sim_geno2[1:10,]
> sim_coord2[1:10,]
You can also visualize the simulated dataset with the function "show.simdata":
> show.simdata(sim_data2, plot.coord=TRUE, plot.tess=TRUE,
plot.freq.grid=TRUE, loc.grid=1, zlim.freq=c(0,1))
The argument “loc.grid” maps allele frequencies at the locus you specify, in this case, locus 1
(which has 5 alleles)
This function produced many graphs, of which I am including the maps of allele frequencies at
locus 1 for alleles 1 and 2:
- 11 -
Download