doc

Using Geneland to Map Genetic Structure in R Eve McCulloch (emccul1@lsu.edu) 12/04/09 This self-tutorial goes step-by-step to help you use Geneland to 1) analyze spatially referenced genotypic data, and 2) create simulated populations. It uses two dataset that will be available on the “R for Ecologists” website: http://www.biology.lsu.edu/webfac/kharms/BIOL7901Fall2009.htm The first dataset is a matrix of coordinates for geo-referenced individuals. The second dataset is a matrix of genotypes for 180 diploid individuals at 10 loci. I created this dataset using a simulation in SIMCOAL2 with the following conditions: Number of populations = 3, effective population sizes of 500, generations since separation = 25, and no migration since separation. The package Geneland v 3.1.5 (Guillet et al 2005b) uses multilocus genetic data to infer population genetic structure, optionally in a spatially explicit context (geo-referenced individual genotypes). Specifically, it’s main functions are as follows: 1) Estimating number of subpopulations and locating their spatial boundaries, 2) calculating F-statistics, 3) creating graphical output of the spatial distribution of the subdivided populations (based on posterior probabilities), and 4) simulating population divergence under isolation by distance (IBD) and barrier models. The package performs Bayesian inference of all parameters involved using Markov Chain MontCarlo (MCMC) simulations. The overall population is assumed to consist of subpopulations in Hardy-Weinberg and Linkage Equilibrium (HWLE), and allele frequencies assumed drawn from a correlated frequency model, though uncorrelated frequencies can be handled. Geneland can account for null alleles. There are a number of functions that I am not reviewing; however, the documentation that comes with Geneland is extensive and a good source for any follow-up questions. For example, Geneland can be used with ArcGIS or other map-producing programs, and graphics can be modified beyond what I go into here. Documentation on this package is available both from CRAN (http://cran.rproject.org/web/packages/Geneland/index.html) and on the Geneland homepage (http://www2.imm.dtu.dk/~gigu/Geneland/). References: G. Guillot. Inference of structure in subdivided populations at low levels of genetic differentiation.The correlated allele frequencies model revisited. Bionformatics, 24:2222–2228, 2008. G. Guillot and M. Foll. Accounting for the ascertainment bias in Markov chain Monte Carlo inferences of population structure. Bioinformatics, 25(4):552–554, 2009a. -1- G. Guillot and F. Santos. A computer program to simulate multilocus genotype data with spatially auto- correlated allele frequencies. Molecular Ecology Resources, 9(4):1112 – 1120, 2009b. G. Guillot, A. Estoup, F. Mortier, and J.F. Cosson. A spatial statistical model for landscape genetics. Genetics, 170(3):1261–1280, 2005a. G. Guillot, F. Mortier, and A. Estoup. Geneland: A computer package for landscape genetics. Molecular Ecology Notes, 5(3):708–711, 2005b. G. Guillot, F. Santos, and A. Estoup. Analysing georeferenced population genetics data with Geneland: a new algorithm to deal with null alleles and a friendly graphical user interface. Bioinformatics, 24(11): 1406–1407, 2008. GETTING STARTED: Install and load the package “Geneland”: > install.packages(“Geneland”) Geneland also depends on the following packages, so be sure they are installed: RandomFields, ﬁelds, mapproj, maps, snow, tcltk > library(Geneland) Load the package Graphical interface (optional): Geneland has the option of using a graphical interface for all the analyses. To do so, simply use the command below: > Geneland.GUI() Call the graphical interface However, for the purpose of this tutorial I will focus on the underlying R commands. Input files and create data tables for genotypic and coordinate data: > coord<-read.table("c:/coord.txt", header=F) Create coordinates data table where every row is an individual; coordinates should be planar and spherical coordinates can be converted to planar in Geneland (see manual) > msat1<-read.table("c:/genotypes.txt", header=F) Create table of genetic data with one line per individual two columns per locus You should check to make sure your data tables were created correctly: -2- > dim(coord) > nrow(msat1) > ncol(msat1) Dimensions should be 180 rows x 2 columns Number of rows should be 180 (individuals) Number of columns should be 20 (diploid individuals scored at 10 microsatellite markers) To plot the geo-referenced individuals: > plot(coord, xlab="Eastings", ylab="Northings", asp=1) I manually and somewhat arbitrarily assigned coordinates to the simulated genotypic data, thus the odd patterning of the points: You can format the genetic data, microsatellite fragments in this case, as alleles coded by positive integers (you can also use SNPs and sequence data, but they have to recoded as if they were alleles). Calling the function FormatGenotypes() produces the output "genotypes" and "allele.numbers" (number of possible alleles per locus): > msat1_format<-FormatGenotypes(msat1) > geno1<-msat1_format$genotypes > allele.no1<-msat1_format$allele.numbers If you want to calculate classical F-statistics you will need to label each individual as belonging to a certain population. To do so you could read in a data containing this information from a .txt -3- file, or in this case I simply created a vector numerical values corresponding to population membership: > pop.mbrship1<-rep(c(1,2,3), each=60) The command "each" means change the concatenated numbers (1,2,3) after 60 repetitions, whereas using the command "times", for example, would mean that 123... repeats for 60 times MCMC ANALYSIS: MCMC inference: > MCMC(coordinates=coord, genotypes=geno1, varnpop=TRUE, npopmax=5,spatial=TRUE, freq.model="Correlated", nit=100000, thinning=100, path.mcmc="c:/folder_name/") The function MCMC() runs MCMC simulations to infer the number of populations and their spatial boundaries. The command “varnpop” is TRUE if the number of HWLE populations is unknown and hence are treated as simulated variables;“npopnmax” set the maximum possible number of subpopulations in your system; “spatial” specifies whether you want to analyze geo-referenced data with a spatial (TRUE) or non-spatial (FALSE) prior. If you select FALSE then coordinates are not used in the inference algorithm, though they are used for the graphical representations; “freq.model” specifies whether allele frequencies in the subpopulations are correlated or uncorrelated; “nit” is the number of iterations in the MCMC chain; “thinning” designates how often the results will be saved (in this case, every 100 iterations). Finally, the command "path.mcmc” designates a file for your output to save in, but it has to be in your current working directory. More commands are possible than I use here. Post-processing MCMC output(s): > PostProcessChain(coordinates=coord, genotypes=geno1, path.mcmc="c:/folder_name/", nxdom=100, nydom=100, burnin=200) The function PostProcessChain() extracts information from the MCMC analysis that is stored in the directory specified by the argument “path.mcmc” and creates files required for final estimations and creating maps: “nxdom” specifies the number of pixels for the horizontal domain (x coordinates) and “nydom” specifies the number of pixels for vertical area of study domain. Generating graphical and numerical output: You can plot the number of population clusters predicted by MCMC, and save the output in a file of your choosing (use command “printit=TRUE” and then designate the file and file format): -4- > Plotnpop(path.mcmc="c:/folder_name/", burnin=200, printit=TRUE, file="c:/folder_name/No_Clusters1.pdf", format="pdf") The plot displays a clear mode at populations K=3. This is what the plot looks like: You can also create a map of the posterior probability of population membership: > PosteriorMode(coordinates=coord,path.mcmc="c:/folder_name/", printit=TRUE, file="c:/folder_name/map1.pdf", format="pdf") This is what the map looks like: black dots are geo-referenced individual genotypes, and colors correspond to population membership based on the mode of the posterior probabilities: -5- You can make more graphics; however, I do not know how yet. For example, using the graphical interface (“Geneland. GUI()”) you can create a map of the posterior probability “topography” of any given pixel belonging to “Cluster” 1, 2 or 3 (subpopulations). Here I show only subpopulations 1 and 2. -6- Checking for MCMC CONVERGENCE: You can make a loop calling for multiple runs of MCMC inference, and then sort runs by decreasing average posterior probability: > nrun<-5 > burnin<-10 I call for only 5 runs to save time I set a small burnin because I’m using few iterations, again to save time > for(irun in 1:nrun) { # Define the path to the MCMC directory path.mcmc<-paste("c:/ folder_name_",irun, "/", sep="") The function “paste” concatenates the components contained in the (): in this case you are setting path.mcmc (in which you will keep the output of these different runs), to be in additional created folders with the name “folder_name_1”, “folder_name_2” etc, for 1 through nrun. system(paste("mkdir ", path.mcmc)) “mkdir” makes a new directory MCMC(coordinates=coord, genotypes=geno1, varnpop=TRUE, npopmax=5,spatial=TRUE, freq.model="Correlated", nit=10000, thinning=100,path.mcmc=path.mcmc) # MCMC post-processing: PostProcessChain(coordinates=coord, genotypes=geno1, path.mcmc=path.mcmc, nxdom=200,nydom=200,burnin=burnin) } Then you can compute the average posterior probability (with burnin of 10 out of the 1000 saved iterations) > lpd<-rep(nrun) > for(irun in 1:nrun) { path.mcmc<-paste("c:/folder_name_”, irun, "/", sep="") path.lpd<-paste(path.mcmc,"log.posterior.density.txt", sep="") lpd[irun]<-mean(scan(path.lpd)[-(1:burnin)]) } Finally, sort the runs by decreasing average posterior probability, and you can choose the best of them: > order(lpd, decreasing=TRUE) -7- With each of these runs you can examine the output, plot or graph it, as you already did above. F-STATISTICS: You can calculate F-statistics according to Weir & Cockerham’s estimators in the classic way where each individual is pre-assigned to a population: > Fstat(geno1, npop=3, pop.mbrship1) The command “Fstat” returns FST and FIS, though I couldn’t figure out how to get a p-value. It may not be possible. You can also compute F-statistic based on the output of the MCMC inference: > Fstat.output(genotypes=geno1,path.mcmc="c:/folder_name/") In this case, population membership is not assigned based by you, but determined by the posterior probabilities. SIMULATIONS: You can simulate a population of geo-referenced genotypes. Below I show an example from the Geneland manual: > simdata1<-simFmodel(nindiv=100, coord.lim=c(0,1,0,1), number.nuclei=15, nall=rep(10,20), npop=3, freq.model="Correlated", drift=rep(0.04,3), dominance="Codominant") To visualize your data and prepare it for analysis: > summary(simdata1) -8- 100 individuals in total Sets simulation limits in the unit square Tessellation driven by 15 polygons 20 loci with 10 alleles each Three subpopulations Correlated frequency model Sets rate of drift in the 3 populations? Co-dominant genotypes (two columns per locus) > sim_geno1<-simdata1$genotypes > sim_coord1<-simdata1$coordinates Once you have this simulated dataset you can run MCMC inference and process the MCMC output: > MCMC(coordinates=sim_coord1, genotypes=sim_geno1, varnpop=TRUE, npopmax=5, spatial=TRUE, freq.model="Correlated", nit=100000, thinning=100, path.mcmc="c:/new_folder_sim/") > PostProcessChain(coordinates=sim_coord1, genotypes=sim_geno1, path.mcmc="c:/new_folder_sim/", nxdom=100, nydom=100, burnin=200) You can generate graphical and numerical output: >Plotnpop(path.mcmc="c:/new_folder_sim/", burnin=200, printit=TRUE, file="c:/ new_folder_sim/No_Clusters_sim.pdf", format="pdf") > PosteriorMode(coordinates=sim_coord1, path.mcmc="c:/new_folder_sim/", printit=TRUE, file="c:/ new_folder_sim/map_sim.pdf", format="pdf") Here is the map: -9- Simulating population data under a model of IBD and barriers to gene flow: This is a function for simulation under the model described in Guillot and Santos (2009b), and it simulates genotypes that are structured by IBD and barriers. It has a lot of possible arguments, so I suggest reading more about it. I took this example directly from the Geneland manual; it simulates genotypes of 100 individuals at 3 loci with 5 alleles at each locus: > sim_data2<-simdata(nindiv=100, number.nuclei=10, allele.numbers=rep(5,3), model="stable", IBD=TRUE, alpha=1, beta=1, gamma=1, npop=3, give.tess.grid=TRUE, give.freq.grid=TRUE, npix=c(100,100), comp.Fst=TRUE, comp.diff=TRUE, width=0.1, plot.pairs.borders=FALSE) > summary(sim_data2) To visualize the values returned > sim_geno2<-sim_data2$genotypes > sim_coord2<-sim_data2$coord.indiv > sim_color.nuclei<-sim_data2$color.nuclei[sim_data2$nearest.nucleus.indiv] Shows population (cluster) membership of the individuals - 10 - You can check the dimensions of your data tables and take a look at them: > dim(sim_coord2) > dim(sim_geno2) > sim_geno2[1:10,] > sim_coord2[1:10,] You can also visualize the simulated dataset with the function "show.simdata": > show.simdata(sim_data2, plot.coord=TRUE, plot.tess=TRUE, plot.freq.grid=TRUE, loc.grid=1, zlim.freq=c(0,1)) The argument “loc.grid” maps allele frequencies at the locus you specify, in this case, locus 1 (which has 5 alleles) This function produced many graphs, of which I am including the maps of allele frequencies at locus 1 for alleles 1 and 2: - 11 -

doc

Related documents

Products

Support

doc

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib