finalpapertopics

advertisement
Advanced Population and Medical Genetics
EPI511, Spring 1, 2016
Suggested topics for short research paper
Please send a single .pdf file, containing your short research paper, by email to Alkes Price
(aprice@hsph.harvard.edu) by 5:00pm on Fri Mar 11. An aggregate list of suggested topics is
provided below. You may either choose one topic from this list, or choose your own topic.
Each student is strongly encouraged to schedule a 15-minute appointment with Alkes during the
week of Feb 29 – Mar 4, and should choose and begin work on their topic prior to this meeting.
The meeting can be scheduled by writing to Jill McDonald (jrmcdona@hsph.harvard.edu).
The short research paper should be 1,000-1,500 words long, and should include an abstract,
plus one figure and one table and at least 10 references (including journal name and year).
Additional subdivision into Introduction, Results, Discussion and Methods sections is optional.
For an example of a short research paper, see Lindstrom et al. 2011 Nat Genet.
Policy on group work: OK to discuss experiences with your colleagues, but each piece of code
that you write and each piece of text that you write should be your own.
[From Week 1]
• Design a metric to evaluate how similar the LD patterns are in 2 populations pop1 and pop2.
How closely do differences in LD patterns correspond to genetic distances as quantified by FST,
either within or across continents? Evaluate and discuss using all 11 HapMap3 populations and
data. Are differences in LD patterns between populations more closely related to drift or to
divergence? See Sved et al. 2008 Am J Hum Genet.
[From Week 2]
• Design and implement a strategy for choosing MT ancestry-informative markers (AIM) from a
set of M total SNPs for which training data from 2 populations is available, which is designed to
maximize FST,AIM (the FST between the 2 populations at that set of AIMs), e.g. as estimated using
independent validation samples from the same (or different but closely related) populations.
Provide theoretical derivations of how FST,AIM is expected to vary with MT, M, the training sample
size N, and the FST between the 2 populations at the set of all M SNPs. Discuss how empirical
FST,AIM results vary as a function of these parameters, and whether they agree with your
derivations. Include both FST<0.01 and FST>0.10 pairs of populations in your analyses.
• Design a metric to determine the extent to which a population is recently admixed by looking
for evidence of LD (beyond what would be expected by chance) at a distance between markers
that is beyond the typical range of LD within a homogeneous population, but within the typical
range of admixture-LD. Apply this test to each of the 11 HapMap3 populations, and also to
simulated admixtures of HapMap3 populations. Include analyses at different sample sizes to
evaluate how sample size affects this statistic. Is this metric an effective way to test if a
population is recently admixed? Discuss.
[From Week 3]
• Design a metric to determine the extent to which a population is structured by looking for
inflation in the distribution of associations between SNPs located on different chromosomes.
Provide theoretical derivations of how this metric, when applied to data from the union of a pair
of populations, is expected to vary with FST and sample size N. Apply this metric to the union of
a pair of distantly related HapMap3 populations, to the union of a pair of closely related
HapMap3 populations, and to individual HapMap3 populations. Discuss how empirical results
vary as a function of FST and sample size N, and whether they agree with your derivations.
• Consider a quantitative trait that is 100% heritable (i.e. 100% determined by genetic factors)
with the phenotype of individual j equal to πj = Σi αigij, where i indexes a set of causal SNPs that
affect phenotype, αi is the effect size of causal SNP i, and gij are normalized genotypes.
Consider a population with 50% of individuals from POP1 and 50% of individuals from POP2.
Let FST denote FST(POP1,POP2) at non-causal SNPs. Let FST,causal denote FST(POP1,POP2) at
causal SNPs, which may be different from FST (see e.g. Chen et al. 2012 PLoS Genet). Derive a
formula for how population stratification (quantified by λGC at the set of non-causal SNPs) in a
disease study of sample size N in this population depends on N, FST and FST,causal. (Note that
FST determines how genotype at non-causal SNPs varies with ancestry and FST,causal determines
how phenotype varies with ancestry.) Compare your derivation to empirical results from
simulations involving HapMap3 data. This should include simulations in which FST = FST,causal, as
well as simulations in which the set of causal SNPs is chosen so that FST ≠ FST,causal
[From Week 4]
• Use theory and simulations (HapMap3 genotypes, simulated phenotypes) to predict and assess
the extent to which the top associated SNP at a locus in a GWAS in one continental population
would replicate in a study in a different population (e.g. via slope of log odds ratio regression),
under each of the following scenarios: (1) the causal SNP is a HapMap3 SNP, and all HapMap3
SNPs are typed, (2) the causal SNP is a HapMap3 SNP and is not typed or imputed, but all
other HapMap3 SNPs are typed, (3) the causal SNP is a HapMap3 SNP, and this SNP along
with a random subset of half of all HapMap3 SNPs is not typed or imputed, but all other
HapMap3 SNPs are typed. Investigate the answer for different choices of GWAS population,
replication population, causal variant allele frequency, and sample sizes. Note: it is appropriate
to combine different populations with the same continental ancestry in order to increase the
sample size for this problem.
• Use theory and simulations (HapMap3 genotypes, simulated phenotypes) to develop and
compare 3 methods for conducting fine-mapping in the ASW population: (1) a method that does
not use local ancestry information, (2) a method that assesses evidence for causality separately
for each local ancestry (0, 1 or 2 European copies) and then aggregates evidence for causality,
(3) a method that assesses evidence for causality separately for each local ancestry, and also
makes use of information about average local ancestry in disease cases, and then aggregates
evidence for causality. Include simulations of (a) randomly chosen causal SNPs, (b) causal
SNPs with unusually different LD patterns between Europeans and Africans, (c) causal SNPs
with unusually large allele frequency differences between Europeans and Africans.
[From Week 5]
• Using CEU and TSI HapMap3 genotypes, simulate a phenotype in which the effect size is
systematically correlated to the allele frequency difference between CEU and TSI, as would be
expected under a scenario of selection for different phenotypes in different environments (see
Turchin et al. 2012 Nat Genet). Use theory and simulations to evaluate the power to detect
such an effect by analyzing association results (after correction for population stratification) at
top associated SNPs (as in Turchin et al.), at a range of parameter settings. Then, extend the
method to use association results at a larger set of SNPs (possibly even all SNPs) instead of
just the top associated SNPs, and evaluate how much this improves power to detect selection.
Note: it is ok to optimistically assume in this problem that correction for population stratification
(e.g. using explicit CEU and TSI ancestry labels) is fully effective in removing spurious signals.
• Suppose that you are analyzing data from 2 populations that admixed g generations ago.
Consider a SNP that had allele frequency p1 in POP1 and p2 in POP2 at the time of admixture.
Suppose that the reference allele is selectively advantageous in the admixed population. Define
selection coefficient s as the relative fitness of the reference vs. variant allele per generation in
the admixed population. Let N be the sample size analyzed from the admixed population, and
let θ and 1−θ denote the ancestry proportions from POP1 and POP2 in the admixed population.
Use theory and simulations to investigate the power of an approach for detecting the action of
natural selection via searching for unusual deviations in local ancestry in the set of N samples.
Provide quantifications of (1) power to detect selection against the sickle-cell allele in African
Americans with European admixture 6 generations ago, and (2) power to detect selection on a
beneficial pigmentation allele in Europeans with Neanderthal admixture 1,500 generations ago.
[From Week 6]
• Using simulated data (HapMap3 genotypes, simulated phenotypes), evaluate the effectiveness
(bias and standard error) of LD Score regression as a means to estimate hg2. Include a
comparison of both weighted and unweighted LD Score regression, and include a comparison
to at least two other methods as well. Include simulations of both infinitesimal (all SNPs causal)
and non-infinitesimal (only a subset of SNPs causal) genetic architectures.
• Or, feel free to design your own research topic.
Download