Advanced Population and Medical Genetics EPI511, Spring 1, 2016 Experience 4 Please submit Python code (iPython notebook format preferred) or PERL code, and its output for each of (1)-(4), on the course www site by 8:00am on Tue Feb 23. Please indicate in your submission the number of hours you spent working on this experience. This information will not affect your grade—only the average value across all students will be shared with the instructor—but will help inform the design of future experiences. Policy on group work: OK to discuss experiences with your colleagues, but each piece of code that you write should be your own. (1) Consider a hypothetical disease for which rs2394043 (chromosome 10) is the causal SNP. Consider a case-control association study in ASW. Define ASW individuals with genotype=0 at rs2394043 to be Cases and remaining ASW individuals to be Controls. Compute a χ2(1dof) association statistic for rs2394043 as well as each SNP located >100kb (100,000 bp) and <10Mb (10,000,000 bp) from rs10761835. For each such SNP with χ2(1dof) > 20, report the SNP, physical position, CEU allele frequency, YRI allele frequency, χ2(1dof) statistic, and χ2(1dof) statistic corrected for genome-wide % European ancestry. (Be sure to subtract the mean off of ancestry before introducing ancestry as a covariate.) Discuss. (2) Define phenotypes of ASW individuals as in (1). Using the local ancestry (0, 1 or 2 European copies) of each ASW individual at the genomic location near rs2394043, compute (a) a case-control admixture association statistic for this locus, (b) a case-control admixture association statistic for this locus with correction for genome-wide % European ancestry, and (c) a case-only admixture association statistic for this locus, using a single value of theta (genome-wide % European ancestry) averaged across Cases. For (c), also report #Cases, average theta in Cases, and average gamma (local ancestry) in Cases. (3) Consider the SNP rs3131972 on chromosome 1. Suppose that this is a causal SNP, and define 0 (of 2) CEU individuals with genotype 0 at this SNP + the first 9 (of 33) CEU individuals with genotype 1 at this SNP + the first 51 (of 77) CEU individuals with genotype 2 at this SNP to be Cases, and the other 52 CEU individuals to be Controls. Define the first 27 (of 69) YRI individuals with genotype 0 at this SNP + the first 29 (of 39) YRI individuals with genotype 1 at this SNP + all 5 (of 5) YRI individuals with genotype 2 at this SNP to be Cases, and the remaining 52 YRI individuals to be Controls. Now pretend that you don’t know which SNP is the causal SNP, but assume that there is exactly 1 causal SNP in this data. (a) Using CEU data only, conduct a fine-mapping study at the locus. What is the posterior probability of each nearby SNP (e.g. within 50kb of rs3131972, including rs3131972 itself) being causal? (b) Using YRI data only, conduct a fine-mapping study at the locus. What is the posterior probability of each nearby SNP being causal? Compare to (a), and discuss. (c) Using CEU + YRI data, conduct a fine-mapping study at the locus (you can just multiply the Bayes factors). What is the posterior probability of each nearby SNP being causal? Compare to (a) and (b), and discuss. (4) For each SNP analyzed in (3) (e.g. within 50kb of rs3131972, including rs3131972 itself) compute the odds ratio in CEU and the odds ratio in YRI. Do SNPs with large effect sizes in CEU have large effect sizes in YRI? Regress log(odds ratio in YRI) vs. log(odds ratio in CEU) (without affine term) to provide a quantitative answer to this question. Discuss. Possible topics for short Research Paper (an aggregate list of suggested topics will be provided on Feb 23. At that time, each student should choose one topic from the aggregate list.): • Use theory and simulations (HapMap3 genotypes, simulated phenotypes) to predict and assess the extent to which the top associated SNP at a locus in a GWAS in one continental population would replicate in a study in a different population (e.g. via slope of log odds ratio regression), under each of the following scenarios: (1) the causal SNP is a HapMap3 SNP, and all HapMap3 SNPs are typed, (2) the causal SNP is a HapMap3 SNP and is not typed or imputed, but all other HapMap3 SNPs are typed, (3) the causal SNP is a HapMap3 SNP, and this SNP along with a random subset of half of all HapMap3 SNPs is not typed or imputed, but all other HapMap3 SNPs are typed. Investigate the answer for different choices of GWAS population, replication population, causal variant allele frequency, and sample sizes. Note: it is appropriate to combine different populations with the same continental ancestry in order to increase the sample size for this problem. • Use theory and simulations (HapMap3 genotypes, simulated phenotypes) to develop and compare 3 methods for conducting fine-mapping in the ASW population: (1) a method that does not use local ancestry information, (2) a method that assesses evidence for causality separately for each local ancestry (0, 1 or 2 European copies) and then aggregates evidence for causality, (3) a method that assesses evidence for causality separately for each local ancestry, and also makes use of information about average local ancestry in disease cases, and then aggregates evidence for causality. Include simulations of (a) randomly chosen causal SNPs, (b) causal SNPs with unusually different LD patterns between Europeans and Africans, (c) causal SNPs with unusually large allele frequency differences between Europeans and Africans. • Or, feel free to design your own research topic.