Advanced Population and Medical Genetics EPI511, Spring 1, 2016 Experience 5 Please submit Python code (iPython notebook format preferred) or PERL code, and its output for each of (1)-(4), on the course www site by 8:00am on Tue Mar 1. Please indicate in your submission the number of hours you spent working on this experience. This information will not affect your grade—only the average value across all students will be shared with the instructor—but will help inform the design of future experiences. Policy on group work: OK to discuss experiences with your colleagues, but each piece of code that you write should be your own. (1) Conduct a genome-wide scan for selection based on unusual population differentiation for (a) CEU vs. TSI (assume genome-wide FST = 0.004), (b) CHB vs. JPT (assume FST = 0.007), and (c) CEU vs. CHB (assume FST = 0.11). In each case, print output only for suggestive SNPs attaining a χ2(1 dof) statistic > 20, with at most one most significant SNP per chromosome. Print the allele frequencies in each population as well as the χ2(1 dof) statistic, and indicate which signals are genome-wide significant (P-value < 5 x 10-8). Discuss results of (a) vs. (b) vs. (c). (2) (a) For each SNP printed as output of (1) (a), repeat the computation assuming that the same CEU and TSI allele frequencies were observed in large sample size (N>>1/FST). Discuss. (b) For each SNP printed as output of (1) (a), repeat the computation using CEU vs. YRI (assume FST = 0.16). Discuss. (c) Repeat (b) assuming that the same CEU and YRI allele frequencies were observed in large sample size (N>>1/FST). Discuss. (3) How far does LD (r2>0.5) with LCT SNP rs13404551 on chr 2 span in the CEU population (in either chromosomal direction)? Repeat the computation for the populations TSI, CHB, YRI. Why does LD vary across these populations? (4) (a) Negative selection due to cystic fibrosis: what will the avg local ancestry (all 3 ancestries) of Puerto Ricans be at the CFTR locus after many generations of admixture, based on Puerto Rican continental ancestry proportions from Week 2 slides and CFTR allele frequency of 2% in European populations? (b) Negative selection due to sickle-cell anemia: what will the average local ancestry (all 3 ancestries) of Mexican Americans be at the HBB locus after many generations of admixture, based on Mexican American continental ancestry proportions from Week 2 slides and HBB allele frequency of 5% in African populations? Possible topics for short Research Paper (an aggregate list of suggested topics will be provided on Feb 23. At that time, each student should choose one topic from the aggregate list.): • Using CEU and TSI HapMap3 genotypes, simulate a phenotype in which the effect size is systematically correlated to the allele frequency difference between CEU and TSI, as would be expected under a scenario of selection for different phenotypes in different environments (see Turchin et al. 2012 Nat Genet). Use theory and simulations to evaluate the power to detect such an effect by analyzing association results (after correction for population stratification) at top associated SNPs (as in Turchin et al.), at a range of parameter settings. Then, extend the method to use association results at a larger set of SNPs (possibly even all SNPs) instead of just the top associated SNPs, and evaluate how much this improves power to detect selection. Note: it is ok to optimistically assume in this problem that correction for population stratification (e.g. using explicit CEU and TSI ancestry labels) is fully effective in removing spurious signals. • Suppose that you are analyzing data from 2 populations that admixed g generations ago. Consider a SNP that had allele frequency p1 in POP1 and p2 in POP2 at the time of admixture. Suppose that the reference allele is selectively advantageous in the admixed population. Define selection coefficient s as the relative fitness of the reference vs. variant allele per generation in the admixed population. Let N be the sample size analyzed from the admixed population, and let θ and 1−θ denote the ancestry proportions from POP1 and POP2 in the admixed population. Use theory and simulations to investigate the power of an approach for detecting the action of natural selection via searching for unusual deviations in local ancestry in the set of N samples. Provide quantifications of (1) power to detect selection against the sickle-cell allele in African Americans with European admixture 6 generations ago, and (2) power to detect selection on a beneficial pigmentation allele in Europeans with Neanderthal admixture 1,500 generations ago. • Or, feel free to design your own research topic.