Advanced Population and Medical Genetics EPI511, Spring 1, 2016 Suggested topics for short research paper Please send a single .pdf file, containing your short research paper, by email to Alkes Price (aprice@hsph.harvard.edu) by 5:00pm on Fri Mar 11. An aggregate list of suggested topics is provided below. You may either choose one topic from this list, or choose your own topic. Each student is strongly encouraged to schedule a 15-minute appointment with Alkes during the week of Feb 29 – Mar 4, and should choose and begin work on their topic prior to this meeting. The meeting can be scheduled by writing to Jill McDonald (jrmcdona@hsph.harvard.edu). The short research paper should be 1,000-1,500 words long, and should include an abstract, plus one figure and one table and at least 10 references (including journal name and year). Additional subdivision into Introduction, Results, Discussion and Methods sections is optional. For an example of a short research paper, see Lindstrom et al. 2011 Nat Genet. Policy on group work: OK to discuss experiences with your colleagues, but each piece of code that you write and each piece of text that you write should be your own. [From Week 1] • Design a metric to evaluate how similar the LD patterns are in 2 populations pop1 and pop2. How closely do differences in LD patterns correspond to genetic distances as quantified by FST, either within or across continents? Evaluate and discuss using all 11 HapMap3 populations and data. Are differences in LD patterns between populations more closely related to drift or to divergence? See Sved et al. 2008 Am J Hum Genet. [From Week 2] • Design and implement a strategy for choosing MT ancestry-informative markers (AIM) from a set of M total SNPs for which training data from 2 populations is available, which is designed to maximize FST,AIM (the FST between the 2 populations at that set of AIMs), e.g. as estimated using independent validation samples from the same (or different but closely related) populations. Provide theoretical derivations of how FST,AIM is expected to vary with MT, M, the training sample size N, and the FST between the 2 populations at the set of all M SNPs. Discuss how empirical FST,AIM results vary as a function of these parameters, and whether they agree with your derivations. Include both FST<0.01 and FST>0.10 pairs of populations in your analyses. • Design a metric to determine the extent to which a population is recently admixed by looking for evidence of LD (beyond what would be expected by chance) at a distance between markers that is beyond the typical range of LD within a homogeneous population, but within the typical range of admixture-LD. Apply this test to each of the 11 HapMap3 populations, and also to simulated admixtures of HapMap3 populations. Include analyses at different sample sizes to evaluate how sample size affects this statistic. Is this metric an effective way to test if a population is recently admixed? Discuss. [From Week 3] • Design a metric to determine the extent to which a population is structured by looking for inflation in the distribution of associations between SNPs located on different chromosomes. Provide theoretical derivations of how this metric, when applied to data from the union of a pair of populations, is expected to vary with FST and sample size N. Apply this metric to the union of a pair of distantly related HapMap3 populations, to the union of a pair of closely related HapMap3 populations, and to individual HapMap3 populations. Discuss how empirical results vary as a function of FST and sample size N, and whether they agree with your derivations. • Consider a quantitative trait that is 100% heritable (i.e. 100% determined by genetic factors) with the phenotype of individual j equal to πj = Σi αigij, where i indexes a set of causal SNPs that affect phenotype, αi is the effect size of causal SNP i, and gij are normalized genotypes. Consider a population with 50% of individuals from POP1 and 50% of individuals from POP2. Let FST denote FST(POP1,POP2) at non-causal SNPs. Let FST,causal denote FST(POP1,POP2) at causal SNPs, which may be different from FST (see e.g. Chen et al. 2012 PLoS Genet). Derive a formula for how population stratification (quantified by λGC at the set of non-causal SNPs) in a disease study of sample size N in this population depends on N, FST and FST,causal. (Note that FST determines how genotype at non-causal SNPs varies with ancestry and FST,causal determines how phenotype varies with ancestry.) Compare your derivation to empirical results from simulations involving HapMap3 data. This should include simulations in which FST = FST,causal, as well as simulations in which the set of causal SNPs is chosen so that FST ≠ FST,causal [From Week 4] • Use theory and simulations (HapMap3 genotypes, simulated phenotypes) to predict and assess the extent to which the top associated SNP at a locus in a GWAS in one continental population would replicate in a study in a different population (e.g. via slope of log odds ratio regression), under each of the following scenarios: (1) the causal SNP is a HapMap3 SNP, and all HapMap3 SNPs are typed, (2) the causal SNP is a HapMap3 SNP and is not typed or imputed, but all other HapMap3 SNPs are typed, (3) the causal SNP is a HapMap3 SNP, and this SNP along with a random subset of half of all HapMap3 SNPs is not typed or imputed, but all other HapMap3 SNPs are typed. Investigate the answer for different choices of GWAS population, replication population, causal variant allele frequency, and sample sizes. Note: it is appropriate to combine different populations with the same continental ancestry in order to increase the sample size for this problem. • Use theory and simulations (HapMap3 genotypes, simulated phenotypes) to develop and compare 3 methods for conducting fine-mapping in the ASW population: (1) a method that does not use local ancestry information, (2) a method that assesses evidence for causality separately for each local ancestry (0, 1 or 2 European copies) and then aggregates evidence for causality, (3) a method that assesses evidence for causality separately for each local ancestry, and also makes use of information about average local ancestry in disease cases, and then aggregates evidence for causality. Include simulations of (a) randomly chosen causal SNPs, (b) causal SNPs with unusually different LD patterns between Europeans and Africans, (c) causal SNPs with unusually large allele frequency differences between Europeans and Africans. [From Week 5] • Using CEU and TSI HapMap3 genotypes, simulate a phenotype in which the effect size is systematically correlated to the allele frequency difference between CEU and TSI, as would be expected under a scenario of selection for different phenotypes in different environments (see Turchin et al. 2012 Nat Genet). Use theory and simulations to evaluate the power to detect such an effect by analyzing association results (after correction for population stratification) at top associated SNPs (as in Turchin et al.), at a range of parameter settings. Then, extend the method to use association results at a larger set of SNPs (possibly even all SNPs) instead of just the top associated SNPs, and evaluate how much this improves power to detect selection. Note: it is ok to optimistically assume in this problem that correction for population stratification (e.g. using explicit CEU and TSI ancestry labels) is fully effective in removing spurious signals. • Suppose that you are analyzing data from 2 populations that admixed g generations ago. Consider a SNP that had allele frequency p1 in POP1 and p2 in POP2 at the time of admixture. Suppose that the reference allele is selectively advantageous in the admixed population. Define selection coefficient s as the relative fitness of the reference vs. variant allele per generation in the admixed population. Let N be the sample size analyzed from the admixed population, and let θ and 1−θ denote the ancestry proportions from POP1 and POP2 in the admixed population. Use theory and simulations to investigate the power of an approach for detecting the action of natural selection via searching for unusual deviations in local ancestry in the set of N samples. Provide quantifications of (1) power to detect selection against the sickle-cell allele in African Americans with European admixture 6 generations ago, and (2) power to detect selection on a beneficial pigmentation allele in Europeans with Neanderthal admixture 1,500 generations ago. [From Week 6] • Using simulated data (HapMap3 genotypes, simulated phenotypes), evaluate the effectiveness (bias and standard error) of LD Score regression as a means to estimate hg2. Include a comparison of both weighted and unweighted LD Score regression, and include a comparison to at least two other methods as well. Include simulations of both infinitesimal (all SNPs causal) and non-infinitesimal (only a subset of SNPs causal) genetic architectures. • Or, feel free to design your own research topic.