Analysis of HapMap Phase II

advertisement
Analysis of HapMap Phase II
Raw Data
Our analysis’ data comes from two sources:


the phase II (rel #21) phased HapMap data all and chimp datasets
(http://wwww.hapmap.org/downloads/phasing/2006-07_phaseII/; the chimp
dataset is not public yet, using the files at /fg/wgas2/hapmap/rel21/phased/)
the preliminary cM distance maps derived from this data, given to us by Simon
Myers (located at
/cvar/research/smyers/HapMapPhase2/HapMapphaseII/genetic_map/rate_all.tar.g
z)
The all dataset has phased data for all three populations, including SNPs that were not
genotyped in all populations. The SNP positions are given with respect to NCBI build
35.1 (hg17, May 2004), and the phased haplotypes are all relative to that build’s forward
strand [the unphased data is available both with respect to the hg17 forward strand and
the rs record of dbSNP, but we don’t use the unphased data]. The snps are coded as 0
and 1; the associated “legend” file can be used to translate a 0/1 into an A/C/G/T base.
The chimp dataset contains only SNPs that were genotyped in all populations, encoded as
0 = ancestral and 1 = derived [check how they came up with this, something along the
lines of “alignment with chimp genome”].
The cM distance maps specify the recombination rate between successive genotyped
SNPs (as far as I can tell, only SNPs that weren’t monomorphic in at least one population
are included in this list). The first column lists the SNPs used in the analysis. The next
three columns give the recombination rate for each population between the SNP listed on
that line and the one on the next line. The next column is the mean of the three
population-specific recombination rates. The final column lists the genetic distance from
the first SNP in the file to the current one.
Three populations are included in the HapMap, with equal numbers of males and
females:
 CEU (CEPH) Utah residents of European descent, 90 people genotyped in
mother-father-child trios for phasing, yielding 60 independent samples (mother &
father) = 120 chromosomes.
 YRI: Yorubans from Western Africa, 90 people in trios = 120 chromosomes
 JPT+CHB: 45 completely independent samples each from Japan (JPT) and Han
Chinese (CHB) populations = 180 chromosomes. Phasing is a lot more errorprone in this population.
There are 22 autosomes for which the data layout is straightforward. The data for the X
and Y chromosomes is more complicated. Two portions of the Y chromosome are
homologous with the X chromosome, and so are diploid in all the samples. The bulk of
the X chromosome, however, is haploid in all the males. Thus, the X chromosome is
divided into three segments:



par1: Bases 90,118—2,688,701 (~1500 SNPs)
non-par: Bases 2,693,518—154,493,116 (~110,000 SNPs). In the non-par files,
the haplotype for the males is a series of dashes (‘-‘), not 0s and 1s.
par2: Bases 154,532,542—154,821,623 (~230 SNPs)
The non-homologous portion of the Y chromosome is not genotyped in the HapMap:
since it doesn’t recombine, it’s not useful for analyses that depend on LD breakdown.
General framework
The script “do_analysis.pl” in the hapmap_phaseII/scripts folder guides the analysis of
the HapMap data. It breaks down the analysis into 16 stages, many of which depend on
previous stages to have completed. When possible, jobs are submitted to LSF instead of
being run at the command line.
You call the script from within the sweep2 folder as follows:
hapmap_phaseII/scripts/do_analysis.sh <stage_num> [--pops <pop
list>] [--chroms <chrom list>] [--X] [--to-X | --to-genome]
The first 10-20 lines of the script specify the options for all the different tests to be
performed, and are described in more detail below. The –pops option specifies what
populations are to be analyzed: by default, the CEU, YRI and JPT+CHB populations are
analyzed. The –chroms option does the same for chromosomes: its default value is “1 2 3
... 21 22 23”.
The --X options is a shortcut for “--pops ‘CEU_X YRI_X JPT+CHB_X’ --chroms 23”.
It’s unclear whether or not the X chromosome should be compared against the autosomal
regions of the genome or against itself (a lot more selection has occurred on X, which
argues that the distribution of scores in X is not neutral). The option ‘--to-X’ compares
the X chromosome against itself, whereas the ‘--to-genome’ option compares X against
the genome.
Stage 1: Importing the data
[--pops option ignored]
hapmap_phaseII/scripts/do_analysis.sh 1 [--chroms <chrom list>]
Options
 phased_file_root: The folder where the HapMap phase II “all” files are located.
 chimp_file_root: The folder where the HapMap phase II "chimp" files are located.
 recomb_rates_root: The folder where the HapMap phase II "recombination rates"
files are located.
Stage 1 creates a Sweep project for each chromosome in the genome inside
hapmap_phaseII/data. For the 22 autosomes and the par1/par2 regions of X (chrom 23),
the CEU data is imported into population “CEU”, YRI into “YRI” and JPT+CHB into
“JPT+CHB”. For the non-par region of X, the CEU data is imported into population
“CEU_X”, YRI into “YRI_X” and JPT+CHB into “JPT+CHB_X”. The distinction is
important because Sweep requires all genotyped SNPs in the same population to have the
same number of samples. Most stages of the analysis thus have to be run twice, as
follows:

Run stage N for autosomal regions of HapMap:
hapmap_phaseII/scripts/do_analysis.sh N

Run stage N for X:
hapmap_phaseII/scripts/do_analysis.sh N --pops “CEU_X YRI_X
JPT+CHB_X” --chroms 23
Stages 2-5: XPop Analysis
hapmap_phaseII/scripts/do_analysis.sh <stage> --pops “<pop1>
<pop2>” [--chroms <chrom list>]
Options:
 xpop_options: options for the CrossPopAllEHH command (see Sweep docs)
 compare_X_to_genome: "true" if the background to use for X is the XPop
distribution for chroms. 1--22 (and the par1/par2 segments of X), "false" if it's the
non-par segment of X only. If "true", the final analysis file name will contain
"vs1-22".
This test considers only SNPs common to <pop1> and <pop2>.
Stage 2 calculates the quantity (Integral AllEHH_<pop1> dx) / (Integral
AllEHH_<pop2> dx) starting at every SNP, and extending either to the left or to the right
up to the point where AllEHH drops to a certain level. The results are stored in the files
hapmap_phaseII/analysis/chr<chr>/xpop_allehh_<pop1>_<pop2>.tsv
Table columns in output:
“genes within 50 kb…” – from RefSeq data which is automatically downloaded from
UCSC website and put into data directory under the species. See GeneInfoTable.java.
Stage 3 calculates the background XPop distribution grouping together all the
chromosomes specified by the --chroms options (1--22 + par1/par2 of X by default), and
stores the result in hapmap_phaseII/analysis/xpop_background_<pop1>_<pop2>.tsv.
Stage 4 annotates the significance files from stage 2 with the number of standard
deviations that each AllEHH logratio value is from the mean. The results are stored in
hapmap_phaseII/chr<chr>/xpop_significance_<pop1>_<pop2>.tsv.
Stage5 identifies the top significance scores in every chromosome, collects them into one
file and sorts them, most significant score first. The most significant scores in every
chromosome (abs(logratio) > 3.5) are stored [unsorted] in the files
hapmap_phaseII/chr<chr>/xpop_significance_<pop1>_<pop2>.tsv. The highest scores
among all the chromosomes are written out (sorted) into
hapmap_phaseII/xpop_sig_scores_<pop1>_<pop2>.tsv. A list of all the significance
scores among all chromosomes is written out to
hapmap_phaseII/xpop_sig_scores_<pop1>_<pop2>.tsv; this list is useful for looking at
the genomewide distribution of scores, to see whether or not there's a skew towards one
population.
Stages 6-9, 9.5 and 9.75, 9.825 and 9.9375: LRH Analysis
hapmap_phaseII/scripts/do_analysis.sh <stage>
Options:
 lrh_options: options for the AnalyzeCores command (see Sweep docs)
 num_freq_bins: number of bins to use when calculating the frequency-dependent
ln(REHH) background distribution.
 lrh_window_size: size of windows in which to group LRH scores (for stages 9.5
and 9.75)
 lrh_high_fraction: fraction of SNPs that must be significant in a window to
declare selection in that window (for stages 9.5 and 9.75)
 lrh_windowed_threshold_lnrehh_dev: the minimum number of standard
deviations above the background distribution mean that an ln(REHH) score has to
be to be declared significant.
 lrh_tag: the prefix to attach to the filenames of windowed LRH analysis results.
 lrh_sliding_tag: the prefix to attach to the filenames of sliding window LRH
analysis results.
Stage 6 runs the LRH analysis on every SNP of every chromosome of every population,
and stores the results in hapmap_phaseII/analysis/chr<chr>/lrh_<pop>.tsv.
Stage 7 calculates the mean and standard deviation of ln(EHH) and ln(REHH) scores in
every frequency bin (using the populations and chromosomes specified by the --chroms
and --pops options), and stores the results in
hapmap_phaseII/analysis/lrh_background_<pop>.tsv.
Stage 8 annotates the analysis files from stage 6 with ln(EHH) and ln(REHH)
significance scores and associate p-value and -log_10(p-value). The results are stored in
the files hapmap_phaseII/analysis/chr<chr>/lrh_significance_<pop>.tsv.
Stage 9 identifies the top significance scores in every chromosome, collects them into one
file and sorts them, most significant score first. The most significant scores in every
chromosome (ln(REHH) deviation > 2.5) are stored [unsorted] in the files
hapmap_phaseII/chr<chr>/lrh_significance_<pop>.tsv. The highest scores among all the
chromosomes are written out (sorted) into
hapmap_phaseII/lrh_all_significance_<pop>.tsv. A list of all the significance scores
among all chromosomes is written out to hapmap_phaseII/lrh_sig_scores_<pop>.tsv;
this list is useful for looking at the genomewide distribution of scores, to see whether or
not there's a skew in one direction.
Stage 9.5 partitions each chromosome into windows of size lrh_window_size, and counts
the fraction of significance scores in each window that are above
lrh_windowed_threshold_lnrehh_dev. The results are stored in
hapmap_phaseII/analysis/chr<chr>/<lrh_tag>_multi_regions_<pop>.tsv.
Stage 9.75 collects the per-chromosome results of stage 9.5 into a genomewide dataset,
sorts it with highest fraction of significant SNPs first, then outputs the results into
hapmap_phaseII/analysis/<lrh_tag>_multi_regions_<pop>.tsv
Stage 9.825 partitions each chromosome into windows of size 2*lrh_window_size,
overlapping by lrh_window_size, and counts the fraction of significance scores in each
window that are above lrh_windowed_threshold_lnrehh_dev. This stage depends on
stage 9.5 having been run. The results are stored in
hapmap_phaseII/analysis/chr<chr>/<lrh_sliding_tag>_multi_regions_<pop>.tsv.
Stage 9.9375 collects the per-chromosome results of stage 9.825 into a genomewide
dataset, sorts it with highest fraction of significant SNPs first, then outputs the results into
hapmap_phaseII/analysis/<lrh_sliding_tag>_multi_regions_<pop>.tsv
Stages 10-15: iHS Analysis
hapmap_phaseII/scripts/do_analysis.sh <stage>
Options:
 ihs_options: options for the iHS command (see Sweep docs)
 num_freq_bins: number of bins to use when calculating the frequency-dependent
iHH logratio background distribution.
 ihs_window_size: size of windows in which to group iHS scores (for stages 13
and 14)
 ihs_high_fraction: fraction of SNPs that must be significant in a window to
declare selection in that window (for stages 13 and 14)



ihs_windowed_abs_iHS_threshold: the minimum absolute value of that iHS has
to have to be declared significant.
ihs_tag: the prefix to attach to the filenames of windowed iHS analysis results.
ihs_sliding_tag: the prefix to attach to the filenames of sliding windowed iHS
analysis results.
Stage 10 runs the iHS analysis on every SNP that has ancestral allele information of
every chromosome of every population, and stores the results in
hapmap_phaseII/analysis/chr<chr>/ihs_<pop>.tsv.
Stage 11 calculates the mean and standard deviation of the unstandardised iHS scores in
every frequency bin (using the populations and chromosomes specified by the --chroms
and --pops options), and stores the results in
hapmap_phaseII/analysis/ihs_background_<pop>.tsv.
Stage 12 annotates the analysis files from stage 10 with iHS scores and associate p-value
and -log_10(p-value). The results are stored in the files
hapmap_phaseII/analysis/chr<chr>/lrh_significance_<pop>.tsv.
Stage 13 partitions each chromosome into windows of size ihs_window_size, and counts
the fraction of iHS scores in each window whose absolute value is above
ihs_windowed_abs_iHS_threshold. The results are stored in
hapmap_phaseII/analysis/chr<chr>/<ihs_tag>_multi_regions_<pop>.tsv.
Stage 14 collects the per-chromosome results of stage 13 into a genomewide dataset,
sorts it with highest fraction of significant SNPs first, then outputs the results into
hapmap_phaseII/analysis/<ihs_tag>_multi_regions_<pop>.tsv
Stage 14.5 partitions each chromosome into windows of size 2*ihs_window_size,
overlapping by ihs_window_size, and counts the fraction of significance scores in each
window that are above ihs_windowed_abs_iHS_threshold. This stage depends on stage
13 having been run. The results are stored in
hapmap_phaseII/analysis/chr<chr>/<ihs_sliding_tag>_multi_regions_<pop>.tsv.
Stage 14.75 collects the per-chromosome results of stage 14.5 into a genomewide dataset,
sorts it with highest fraction of significant SNPs first, then outputs the results into
hapmap_phaseII/analysis/<ihs_sliding_tag>_multi_regions_<pop>.tsv
Stage 15 identifies the top significance scores in every chromosome, collects them into
one file and sorts them, most significant score first. The most significant scores in every
chromosome (| iHS | > 2.5) are stored [unsorted] in the files
hapmap_phaseII/chr<chr>/ihs_significance_<pop>.tsv. The highest scores among all the
chromosomes are written out (sorted) into hapmap_phaseII/ihs_sig_scores_<pop>.tsv. A
list of all the significance scores among all chromosomes is written out to
hapmap_phaseII/ihs_all_significance_<pop>.tsv; this list is useful for looking at the
genomewide distribution of scores, to see whether or not there's a skew in one direction.
General notes:
“population” here really refers to sample from one of the three populations.
Download