Lecture 14: Population structure and Population Assignment February 28, 2014 Last Time u Sample calculation of FST u Defining populations on genetic criteria: introduction to Structure Today Interpretation of F-statistics More on the Structure program Principal Components Analysis Population assignment FST: What does it tell us? Degree of differentiation of subpopulations Rules of thumb: 0.05 to 0.15 is weak to moderate 0.15 to 0.25 is strong differentiation >0.25 is very strong differentiation Related to the historical level of gene exchange between populations May not represent current conditions FST is related to life history Seed Dispersal Gravity Explosive/capsule Winged/Plumose 0.446 0.262 0.079 Successional Stage Early 0.411 Middle 0.184 Late 0.105 Life Cycle Annual Short-lived Long-lived 0.430 0.262 0.077 (Loveless and Hamrick, 1984) Structure Program One of the most widely-used programs in population genetics (original paper cited >11,000 times since 2000) Very flexible model can determine: The most likely number of uniform groups (populations, K) The genomic composition of each individual (admixture coefficients) Possible population of origin A simple model of population structure Individuals in our sample represent a mixture of K (unknown) ancestral populations. Each population is characterized by (unknown) allele frequencies at each locus. Within populations, markers are in Hardy-Weinberg and linkage equilibrium. Roughly speaking, the model sorts individuals into K clusters so as to minimize departures from HWE and Linkage Equilibrium. Slide adapted from Jonathan Pritchard, 2007 presentation to Conservation Genetics meeting More on the model... l Let A1, A2, …, AK represent the (unknown) allele frequencies in each subpopulation l Let Z1, Z2, … , Zm represent the (unknown) subpopulation of origin of the sampled individuals l Assuming Hardy-Weinberg and linkage equilibrium within subpopulations, the likelihood of an individual’s genotype in subpopulation k is given by the product of the relevant allele frequencies: Pr(Gi | Zi= k, Ak) = Ploci Pl Where Pl is probability of observing genotype l at a particular locus in subpopulation k Slide adapted from Jonathan Pritchard, 2007 presentation to Conservation Genetics meeting Probability of observing a genotype in a subpopulation Probability of observing a genotype at locus l by chance in population is a function of allele frequencies: Pl = p 2 i Homozygote Pl = 2 pi p j Heterozygote m P = Õ Pl for m loci l=1 Assumes unlinked (independent loci) and HardyWeinberg equilibrium If we knew the population allele frequencies in advance, then it would be easy to assign individuals. If we knew the individual assignments, it would be easy to estimate frequencies. In practice, we don’t know either of these, but the following MCMC algorithm converges to sensible joint estimates of both. Slide adapted from Jonathan Pritchard, 2007 presentation to Conservation Genetics meeting MCMC Algorithms Provide a way of Efficiently Exploring Parameter Space to Find the Most Probable Combination of Values http://www.frankfurt-consulting.de/English/optimierung_us.htm Take Stat 745 Data Mining with Dr. Culp for gory details MCMC algorithm (for fixed K) Start with random assignment of individuals to populations Step 1: Gene frequencies in each population are estimated based on the individuals that are assigned to it. Step 2: Individuals are assigned to populations based on gene frequencies in each population. Continue this process many times to maximize likelihood of the arrangement …Estimation of K performed separately. Slide adapted from Jonathan Pritchard, 2007 presentation to Conservation Genetics meeting Admixed individuals are mosaics of ancestry from the original populations Ancestral Populations Slide adapted from Jonathan Pritchard, 2007 presentation to Conservation Genetics meeting The two basic ancestry models used by structure. No Admixture: each individual is derived completely from a single subpopulation Admixture: individuals may have mixed ancestry: some fraction qk of the genome of individual i is derived from subpopulation k. The admixture model allows for hybrids, but it is more flexible and often provides a better fit for complicated structure. This is what we used in lab. Slide adapted from Jonathan Pritchard, 2007 presentation to Conservation Genetics meeting Notes on Estimating the Number of Subpopulations (k) u Likelihood-based method is the simplest, but likelihood often increases continuously with k u More variability at values of k beyond “natural” value u Evanno et al. (2005) method measures change in likelihood and discounts for variation u Use biological reasoning at arriving at final value u Can also incorporate prior expectations based on population locations, other information (e.g., Geneland package) u Often need to do hierarchical analyses: break into subregions and run Structure separately for each Estimating K Structure is run separately at different values of K. The program computes a statistic that measures the fit of each value of K (sort of a penalized likelihood); this can be used to help select K. Assumed value of K 1 2 3 Ln(Pr(D|KmM))) -71500 -69200 -70500 Convert to posterior probability using Bayes’ Theorem: Pr(p | Data) = Pr(Data | p)Pr( p) 3 å Pr(Data | p )Pr( p ) i i=1 i Another method for inference of K The K method of Evanno et al. (2005, Mol. Ecol. 14: 2611-2620): Eckert, Population Structure, 5-Aug-2008 46 Inferred human population structure Africans Europeans MidEast Cent/S Asia Asia Oceania America Each individual is a thin vertical line that is partitioned into K colored segments according to its membership coefficients in K clusters. Rosenberg et al. 2002 Science 298: 2381-2385 Structure is Hierarchical: Groups reveal more substructure when examined separately Rosenberg et al. 2002 Science 298: 2381-2385 Alternative clustering method: Principal Components Analysis Structure is very computationally intensive Often no clear best-supported K-value Alternative is to use traditional multivariate statistics to find uniform groups Principal Components Analysis is most commonly used algorithm EIGENSOFT (PCA, Patterson et al., 2006; PloS Genetics 2:e190). Eckert, Population Structure, 5-Aug-2008 49 Principal Components Analysis Efficient way to summarize multivariate data like genotypes Each axis passes through maximum variation in data, explains a component of the variation http://www.mech.uq.edu.au/courses/mech 4710/pca/s1.htm How do we identify population of origin? Once you have populations defined, can you assign a migrant individual to their population of origin? Human Population Assignment with SNP Assayed 500,000 SNP genotypes for 3,192 Europeans Used Principal Components Analysis to ordinate samples in space High correspondence betweeen sample ordination and geographic origin of samples Individuals assigned to populations of origin with high accuracy Novembre et al. 2008 Nature 456:98 Using Structure to Show Populations of Origin: Taita Thrush data Three main sampling locations in Kenya Low migration rates (radio-tagging study) 155 individuals, genotyped at 7 microsatellite loci Slide courtesy of Jonathan Pritchard Likelihood Approaches Allow evaluation of alternative hypotheses by comparing their relative likelihoods given the evidence L( H1 , H 2 | E ) P ( E | H1 ) P( E | H 2 ) In a population assignment or forensic context, definition of the competing hypothesis is the most essential component Population Assignment: Likelihood Assume you find skin cells and blood under fingernails of a murder victim Victim had major debts with the Sicilian mafia as well as the Chinese mafia Can population assignment help to focus investigation? P(G | H1 ) L( H1 , H 2 | G ) LR , P(G | H 2 ) What is H1 and what is H2? Population Assignment: Likelihood "Assignment Tests" based on allele frequencies in source populations and genetic composition of individuals Likelihood-Based Approaches Calculate likelihood that individual genotype originated in particular population Assume Hardy-Weinberg and linkage equilibria Genotype frequencies corrected for presence of sampled individual Usually reported as log10 likelihood for origin in given population relative to other population Implemented in ‘GENECLASS’ program (http://www.montpellier.inra.fr/URLB/geneclass/g eneclass.html) Pk l p 2 il for homozygote AiAi in population l at locus k Pkl 2 pil p jl for heterozygote AiAj in population l at locus k m P Pk k 1 for m loci Power of Population Assignment using Likelihood Assignment success depends on: Number of markers used Polymorphism of markers Number of possible source populations Differentiation of populations Accuracy of allele frequency estimations Rules of Thumb (Cornuet et al. 1999) for 100% assignment success, for 10 reference populations need: 30 to 50 reference individuals per population 10 microsatellite loci HE > 0.6 FST > 0.1 Population Assignment Example: Wolf Populations in Northwest Territories Wolf populations sampled on island and mainland populations in Canadian Northwest Territories Immigrants detected on mainland (black circles) from Banks Island (white circles) Carmichael et al. 2001 Mol Ecol 10:2787 Population Assignment Example:Fish Stories Fishing competition on Lake Saimaa in Southeast Finland Contestant allegedly caught a 5.5 kg salmon, much larger than usual for the lake Compared fish from the lake to fish from local markets (originating from Norway and Baltic sea) 7 microsatellites Lake Saimaa Based on likelihood analysis, fish was purchased rather than caught in lake - Market