Advanced Population and Medical Genetics EPI511, Spring 1, 2014 Experience 3 Please send a single file, containing Python or PERL code and the output it produces for each of (1)-(5), by email to Tristan Hayeck (thayeck@hsph.harvard.edu) by 8:00am on Tue Feb 18. Please indicate in your email the number of hours you spent working on this experience. This information will not affect your grade—only the average value across all students will be shared with the instructor—but will help inform the design of future experiences. All source code should be written from scratch. Please do not use built-in functions such as linear algebra functions, functions in the numpy package in Python, etc. Policy on group work: OK to discuss experiences with your colleagues, but each piece of code that you write should be your own. (1) Consider a hypothetical case-control association study involving the first 80 CEU samples and the first 80 GIH samples. Label the first 60 CEU samples and 40 GIH samples as Cases, and the remaining samples as Controls. Compute case-control association statistics for every SNP on chromosome 14 using the Armitage Trend Test. Are association statistics inflated? What is λGC? How does this compare to what would be expected given the FST between Case and Control populations and the #samples? Apply Genomic Control to correct for stratification. Are corrected association statistics inflated? What is λGC for the corrected statistics? Repeat the above computations using only the first 40 CEU samples (first 30 are Cases, rest are Controls) and the first 40 GIH samples (first 20 are Cases, rest are Controls). How do the results change? (2) Consider a hypothetical eigenvector for the set of 40 CEU + 40 GIH samples from (1), which has value -1/sqrt(80) for each CEU sample and +1/sqrt(80) for each GIH sample. Recompute association statistics from by correcting for this eigenvector, instead of applying Genomic Control. Are corrected association statistics inflated? What is λGC for the corrected statistics? (3) Consider a hypothetical case-control association study of the lactase persistence phenotype involving all 112 CEU samples and 88 TSI samples. Although this phenotype was not reported, it can be approximated by the genotype at SNP rs13404551, which is strongly correlated to the SNP rs4988235 that is known to perfectly predict lactase phenotype. Specifically, define CEU or TSI individuals with genotype=0 or 1 at rs13404551 to be Cases, and remaining CEU or TSI individuals to be Controls. Compute case-control association statistics for every SNP on chromosome 14 using the Armitage Trend Test. Are association statistics inflated? What is λGC? How does this compare to what would be expected given the FST between Case and Control populations and the #samples? Apply Genomic Control to correct for stratification. Are corrected association statistics inflated? What is λGC for the corrected statistics? (4) Consider a hypothetical eigenvector for the set of 200 samples from (3), which has value -1/112 for each CEU sample and +1/88 for each TSI sample, normalized to sum of squares = 1. Recompute association statistics by correcting for this eigenvector, instead of applying Genomic Control. Are corrected association statistics inflated? What is λGC for the corrected statistics? (5) Use values of pcase (allele frequency in cases) and pcontrol (allele frequency in controls) to randomly assign diploid genotypes to a set of 56 cases + 56 controls so that the Armitage Trend Test χ2 statistic is between 20 and 40. Compare it to association statistics adjusted for 5, 10, 20, 50, or 100 random eigenvectors, respectively. Discuss. (The random eigenvectors should have mean equal to 0 and sum of squares equal to 1 and be orthogonal to each other eigenvector.) Possible topics for short Research Paper (an aggregate list of suggested topics will be provided on Feb 25. At that time, each student should choose one topic from the aggregate list.): • Design a metric to determine the extent to which a population is structured by looking for inflation in the distribution of associations between SNPs located on different chromosomes, analogous to λGC. Note that SNPs on different chromosomes would be expected to be uncorrelated in an unstructured population, but potentially correlated in a structured population. Apply this metric to the union of a pair of distantly related HapMap3 populations, to the union of a pair of closely related HapMap3 populations, and to individual HapMap3 populations. Include analyses at different sample sizes to evaluate how sample size affects this statistic. Is this metric an effective way to test if a population is structured? Discuss. • Suppose that a disease study is being conducted in a union of distinct populations, with differences in population membership between cases and controls. Compare the effectiveness of PCA correction vs. Structured Association in correcting for stratification, both genome-wide and at selected highly differentiated SNPs. For PCA correction, use a set of eigenvector(s) specified according to known population membership to appropriately model the structure. For Structured Association, define clusters based on known population membership and define association statistics as weighted z-scores (note: this is simpler than the approach described in Pritchard et al. 2000 Am J Hum Genet), e.g. z = [sqrt(N1)*z1 + sqrt(N2)*z2] / sqrt(N1+N2) where N1, N2 are sample sizes and z1, z2 are signed (normally distributed) z-scores whose square is a chisq statistic. Conduct this comparison for a union of 2 closely related populations, a union of 2 distantly related populations, and a union of 3 closely and distantly related populations. • Consider a quantitative trait which is 100% heritable (i.e. 100% determined by genetic factors) with the phenotype of individual j equal to πj = Σi αigij, where i indexes a set of causal SNPs that affect phenotype, αi is the effect size of causal SNP i, and gij are normalized genotypes. Consider a population with 50% of individuals from POP1 and 50% of individuals from POP2. Let FST denote FST(POP1,POP2) at non-causal SNPs. Let FST,causal denote FST(POP1,POP2) at causal SNPs, which may be different from FST (see e.g. Chen et al. 2012 PLoS Genet). Derive a formula for how population stratification (lambda_GC at the set of non-causal SNPs) in a disease study of sample size N in this population depends on N, FST and FST,causal. (Note that FST determines how genotype at non-causal SNPs varies with ancestry and FST,causal determines how phenotype varies with ancestry.) Demonstrate that your derivation is correct via simulations involving HapMap3 data. This should include simulations in which FST = FST,causal, as well as simulations in which the set of causal SNPs is chosen so that FST ≠ FST,causal • Or, feel free to design your own research topic.