Advanced Population and Medical Genetics EPI511, Spring 1, 2016 Experience 3 Please submit Python code (iPython notebook format preferred) or PERL code, and its output for each of (1)-(4), on the course www site by 8:00am on Tue Feb 16. Please indicate in your submission the number of hours you spent working on this experience. This information will not affect your grade—only the average value across all students will be shared with the instructor—but will help inform the design of future experiences. All source code should be written from scratch. Please do not use built-in functions such as linear algebra functions, functions in the numpy package in Python, etc. Policy on group work: OK to discuss experiences with your colleagues, but each piece of code that you write should be your own. (1) Consider a hypothetical case-control association study involving the first 80 CEU samples and the first 80 GIH samples. Label the first 60 CEU samples and 40 GIH samples as Cases, and the remaining samples as Controls. Compute case-control association statistics for every SNP on chromosome 22 using the Armitage Trend Test. Are association statistics inflated? What is λGC? How does this compare to what would be expected given the FST between Case and Control populations and the #samples? Apply Genomic Control to correct for stratification. Are corrected association statistics inflated? What is λGC for the corrected statistics? Repeat the above computations using only the first 40 CEU samples (first 30 are Cases, rest are Controls) and the first 40 GIH samples (first 20 are Cases, rest are Controls). How do the results change? (2) Consider a hypothetical eigenvector for the set of 160 samples from (1), which has value -1/sqrt(160) for each CEU sample and +1/sqrt(160) for each GIH sample. Recompute chr22 association statistics from by correcting for this eigenvector, instead of applying Genomic Control. Are corrected association statistics inflated? What is λGC for the corrected statistics? (3) Consider a hypothetical case-control association study of the lactase persistence phenotype involving all 112 CEU samples and 88 TSI samples. Although this phenotype was not reported, it can be approximated by the genotype at SNP rs13404551, which is strongly correlated to the SNP rs4988235 that is known to perfectly predict lactase phenotype. Specifically, define CEU or TSI individuals with genotype=0 or 1 at rs13404551 to be Cases, and remaining CEU or TSI individuals to be Controls. Compute case-control association statistics for every SNP on chromosome 22 using the Armitage Trend Test. Are association statistics inflated? What is λGC? How does this compare to what would be expected given the FST between Case and Control populations and the #samples? Describe and apply 2 different strategies to correct for stratification. Report the λGC for the corrected statistics in each case. (4) Use values of pcase (allele frequency in cases) and pcontrol (allele frequency in controls) to randomly assign diploid genotypes to a set of 56 cases + 56 controls, with pcase and pcontrol chosen so that the Armitage Trend Test χ2 statistic is between 20 and 40. Compare it to association statistics adjusted for 5, 10, 20, 50, or 100 random eigenvectors, respectively. Discuss. (The random eigenvectors should have mean 0 and sum of squares equal to 1 and be orthogonal to each other eigenvector.) Possible topics for short Research Paper (an aggregate list of suggested topics will be provided on Feb 23. At that time, each student should choose one topic from the aggregate list.): • Design a metric to determine the extent to which a population is structured by looking for inflation in the distribution of associations between SNPs located on different chromosomes. Provide theoretical derivations of how this metric, when applied to data from the union of a pair of populations, is expected to vary with FST and sample size N. Apply this metric to the union of a pair of distantly related HapMap3 populations, to the union of a pair of closely related HapMap3 populations, and to individual HapMap3 populations. Discuss how empirical results vary as a function of FST and sample size N, and whether they agree with your derivations. • Consider a quantitative trait that is 100% heritable (i.e. 100% determined by genetic factors) with the phenotype of individual j equal to πj = Σi αigij, where i indexes a set of causal SNPs that affect phenotype, αi is the effect size of causal SNP i, and gij are normalized genotypes. Consider a population with 50% of individuals from POP1 and 50% of individuals from POP2. Let FST denote FST(POP1,POP2) at non-causal SNPs. Let FST,causal denote FST(POP1,POP2) at causal SNPs, which may be different from FST (see e.g. Chen et al. 2012 PLoS Genet). Derive a formula for how population stratification (quantified by λGC at the set of non-causal SNPs) in a disease study of sample size N in this population depends on N, FST and FST,causal. (Note that FST determines how genotype at non-causal SNPs varies with ancestry and FST,causal determines how phenotype varies with ancestry.) Compare your derivation to empirical results from simulations involving HapMap3 data. This should include simulations in which FST = FST,causal, as well as simulations in which the set of causal SNPs is chosen so that FST ≠ FST,causal • Or, feel free to design your own research topic.