experience3 - Broad Institute

advertisement
Advanced Population and Medical Genetics
EPI511, Spring 1, 2014
Experience 3
Please send a single file, containing Python or PERL code and the output it produces for each
of (1)-(5), by email to Tristan Hayeck (thayeck@hsph.harvard.edu) by 8:00am on Tue Feb 18.
Please indicate in your email the number of hours you spent working on this experience. This
information will not affect your grade—only the average value across all students will be shared
with the instructor—but will help inform the design of future experiences.
All source code should be written from scratch. Please do not use built-in functions such as
linear algebra functions, functions in the numpy package in Python, etc.
Policy on group work: OK to discuss experiences with your colleagues, but each piece of code
that you write should be your own.
(1) Consider a hypothetical case-control association study involving the first 80 CEU samples
and the first 80 GIH samples. Label the first 60 CEU samples and 40 GIH samples as Cases,
and the remaining samples as Controls. Compute case-control association statistics for every
SNP on chromosome 14 using the Armitage Trend Test. Are association statistics inflated?
What is λGC? How does this compare to what would be expected given the FST between Case
and Control populations and the #samples? Apply Genomic Control to correct for stratification.
Are corrected association statistics inflated? What is λGC for the corrected statistics? Repeat the
above computations using only the first 40 CEU samples (first 30 are Cases, rest are Controls)
and the first 40 GIH samples (first 20 are Cases, rest are Controls). How do the results change?
(2) Consider a hypothetical eigenvector for the set of 40 CEU + 40 GIH samples from (1), which
has value -1/sqrt(80) for each CEU sample and +1/sqrt(80) for each GIH sample. Recompute
association statistics from by correcting for this eigenvector, instead of applying Genomic
Control. Are corrected association statistics inflated? What is λGC for the corrected statistics?
(3) Consider a hypothetical case-control association study of the lactase persistence phenotype
involving all 112 CEU samples and 88 TSI samples. Although this phenotype was not reported,
it can be approximated by the genotype at SNP rs13404551, which is strongly correlated to the
SNP rs4988235 that is known to perfectly predict lactase phenotype. Specifically, define CEU
or TSI individuals with genotype=0 or 1 at rs13404551 to be Cases, and remaining CEU or TSI
individuals to be Controls. Compute case-control association statistics for every SNP on
chromosome 14 using the Armitage Trend Test. Are association statistics inflated? What is
λGC? How does this compare to what would be expected given the FST between Case and
Control populations and the #samples? Apply Genomic Control to correct for stratification. Are
corrected association statistics inflated? What is λGC for the corrected statistics?
(4) Consider a hypothetical eigenvector for the set of 200 samples from (3), which has value
-1/112 for each CEU sample and +1/88 for each TSI sample, normalized to sum of squares = 1.
Recompute association statistics by correcting for this eigenvector, instead of applying Genomic
Control. Are corrected association statistics inflated? What is λGC for the corrected statistics?
(5) Use values of pcase (allele frequency in cases) and pcontrol (allele frequency in controls) to
randomly assign diploid genotypes to a set of 56 cases + 56 controls so that the Armitage Trend
Test χ2 statistic is between 20 and 40. Compare it to association statistics adjusted for 5, 10, 20,
50, or 100 random eigenvectors, respectively. Discuss. (The random eigenvectors should have
mean equal to 0 and sum of squares equal to 1 and be orthogonal to each other eigenvector.)
Possible topics for short Research Paper (an aggregate list of suggested topics will be provided
on Feb 25. At that time, each student should choose one topic from the aggregate list.):
• Design a metric to determine the extent to which a population is structured by looking for
inflation in the distribution of associations between SNPs located on different chromosomes,
analogous to λGC. Note that SNPs on different chromosomes would be expected to be
uncorrelated in an unstructured population, but potentially correlated in a structured population.
Apply this metric to the union of a pair of distantly related HapMap3 populations, to the union of
a pair of closely related HapMap3 populations, and to individual HapMap3 populations. Include
analyses at different sample sizes to evaluate how sample size affects this statistic. Is this
metric an effective way to test if a population is structured? Discuss.
• Suppose that a disease study is being conducted in a union of distinct populations, with
differences in population membership between cases and controls. Compare the effectiveness
of PCA correction vs. Structured Association in correcting for stratification, both genome-wide
and at selected highly differentiated SNPs. For PCA correction, use a set of eigenvector(s)
specified according to known population membership to appropriately model the structure. For
Structured Association, define clusters based on known population membership and define
association statistics as weighted z-scores (note: this is simpler than the approach described in
Pritchard et al. 2000 Am J Hum Genet), e.g. z = [sqrt(N1)*z1 + sqrt(N2)*z2] / sqrt(N1+N2) where
N1, N2 are sample sizes and z1, z2 are signed (normally distributed) z-scores whose square is
a chisq statistic. Conduct this comparison for a union of 2 closely related populations, a union
of 2 distantly related populations, and a union of 3 closely and distantly related populations.
• Consider a quantitative trait which is 100% heritable (i.e. 100% determined by genetic factors)
with the phenotype of individual j equal to πj = Σi αigij, where i indexes a set of causal SNPs that
affect phenotype, αi is the effect size of causal SNP i, and gij are normalized genotypes.
Consider a population with 50% of individuals from POP1 and 50% of individuals from POP2.
Let FST denote FST(POP1,POP2) at non-causal SNPs. Let FST,causal denote FST(POP1,POP2) at
causal SNPs, which may be different from FST (see e.g. Chen et al. 2012 PLoS Genet). Derive a
formula for how population stratification (lambda_GC at the set of non-causal SNPs) in a
disease study of sample size N in this population depends on N, FST and FST,causal. (Note that
FST determines how genotype at non-causal SNPs varies with ancestry and FST,causal determines
how phenotype varies with ancestry.) Demonstrate that your derivation is correct via
simulations involving HapMap3 data. This should include simulations in which FST = FST,causal, as
well as simulations in which the set of causal SNPs is chosen so that FST ≠ FST,causal
• Or, feel free to design your own research topic.
Download