Analysis of HapMap Phase II Raw Data Our analysis’ data comes from two sources: the phase II (rel #21) phased HapMap data all and chimp datasets (http://wwww.hapmap.org/downloads/phasing/2006-07_phaseII/; the chimp dataset is not public yet, using the files at /fg/wgas2/hapmap/rel21/phased/) the preliminary cM distance maps derived from this data, given to us by Simon Myers (located at /cvar/research/smyers/HapMapPhase2/HapMapphaseII/genetic_map/rate_all.tar.g z) The all dataset has phased data for all three populations, including SNPs that were not genotyped in all populations. The SNP positions are given with respect to NCBI build 35.1 (hg17, May 2004), and the phased haplotypes are all relative to that build’s forward strand [the unphased data is available both with respect to the hg17 forward strand and the rs record of dbSNP, but we don’t use the unphased data]. The snps are coded as 0 and 1; the associated “legend” file can be used to translate a 0/1 into an A/C/G/T base. The chimp dataset contains only SNPs that were genotyped in all populations, encoded as 0 = ancestral and 1 = derived [check how they came up with this, something along the lines of “alignment with chimp genome”]. The cM distance maps specify the recombination rate between successive genotyped SNPs (as far as I can tell, only SNPs that weren’t monomorphic in at least one population are included in this list). The first column lists the SNPs used in the analysis. The next three columns give the recombination rate for each population between the SNP listed on that line and the one on the next line. The next column is the mean of the three population-specific recombination rates. The final column lists the genetic distance from the first SNP in the file to the current one. Three populations are included in the HapMap, with equal numbers of males and females: CEU (CEPH) Utah residents of European descent, 90 people genotyped in mother-father-child trios for phasing, yielding 60 independent samples (mother & father) = 120 chromosomes. YRI: Yorubans from Western Africa, 90 people in trios = 120 chromosomes JPT+CHB: 45 completely independent samples each from Japan (JPT) and Han Chinese (CHB) populations = 180 chromosomes. Phasing is a lot more errorprone in this population. There are 22 autosomes for which the data layout is straightforward. The data for the X and Y chromosomes is more complicated. Two portions of the Y chromosome are homologous with the X chromosome, and so are diploid in all the samples. The bulk of the X chromosome, however, is haploid in all the males. Thus, the X chromosome is divided into three segments: par1: Bases 90,118—2,688,701 (~1500 SNPs) non-par: Bases 2,693,518—154,493,116 (~110,000 SNPs). In the non-par files, the haplotype for the males is a series of dashes (‘-‘), not 0s and 1s. par2: Bases 154,532,542—154,821,623 (~230 SNPs) The non-homologous portion of the Y chromosome is not genotyped in the HapMap: since it doesn’t recombine, it’s not useful for analyses that depend on LD breakdown. General framework The script “do_analysis.pl” in the hapmap_phaseII/scripts folder guides the analysis of the HapMap data. It breaks down the analysis into 16 stages, many of which depend on previous stages to have completed. When possible, jobs are submitted to LSF instead of being run at the command line. You call the script from within the sweep2 folder as follows: hapmap_phaseII/scripts/do_analysis.sh <stage_num> [--pops <pop list>] [--chroms <chrom list>] [--X] [--to-X | --to-genome] The first 10-20 lines of the script specify the options for all the different tests to be performed, and are described in more detail below. The –pops option specifies what populations are to be analyzed: by default, the CEU, YRI and JPT+CHB populations are analyzed. The –chroms option does the same for chromosomes: its default value is “1 2 3 ... 21 22 23”. The --X options is a shortcut for “--pops ‘CEU_X YRI_X JPT+CHB_X’ --chroms 23”. It’s unclear whether or not the X chromosome should be compared against the autosomal regions of the genome or against itself (a lot more selection has occurred on X, which argues that the distribution of scores in X is not neutral). The option ‘--to-X’ compares the X chromosome against itself, whereas the ‘--to-genome’ option compares X against the genome. Stage 1: Importing the data [--pops option ignored] hapmap_phaseII/scripts/do_analysis.sh 1 [--chroms <chrom list>] Options phased_file_root: The folder where the HapMap phase II “all” files are located. chimp_file_root: The folder where the HapMap phase II "chimp" files are located. recomb_rates_root: The folder where the HapMap phase II "recombination rates" files are located. Stage 1 creates a Sweep project for each chromosome in the genome inside hapmap_phaseII/data. For the 22 autosomes and the par1/par2 regions of X (chrom 23), the CEU data is imported into population “CEU”, YRI into “YRI” and JPT+CHB into “JPT+CHB”. For the non-par region of X, the CEU data is imported into population “CEU_X”, YRI into “YRI_X” and JPT+CHB into “JPT+CHB_X”. The distinction is important because Sweep requires all genotyped SNPs in the same population to have the same number of samples. Most stages of the analysis thus have to be run twice, as follows: Run stage N for autosomal regions of HapMap: hapmap_phaseII/scripts/do_analysis.sh N Run stage N for X: hapmap_phaseII/scripts/do_analysis.sh N --pops “CEU_X YRI_X JPT+CHB_X” --chroms 23 Stages 2-5: XPop Analysis hapmap_phaseII/scripts/do_analysis.sh <stage> --pops “<pop1> <pop2>” [--chroms <chrom list>] Options: xpop_options: options for the CrossPopAllEHH command (see Sweep docs) compare_X_to_genome: "true" if the background to use for X is the XPop distribution for chroms. 1--22 (and the par1/par2 segments of X), "false" if it's the non-par segment of X only. If "true", the final analysis file name will contain "vs1-22". This test considers only SNPs common to <pop1> and <pop2>. Stage 2 calculates the quantity (Integral AllEHH_<pop1> dx) / (Integral AllEHH_<pop2> dx) starting at every SNP, and extending either to the left or to the right up to the point where AllEHH drops to a certain level. The results are stored in the files hapmap_phaseII/analysis/chr<chr>/xpop_allehh_<pop1>_<pop2>.tsv Table columns in output: “genes within 50 kb…” – from RefSeq data which is automatically downloaded from UCSC website and put into data directory under the species. See GeneInfoTable.java. Stage 3 calculates the background XPop distribution grouping together all the chromosomes specified by the --chroms options (1--22 + par1/par2 of X by default), and stores the result in hapmap_phaseII/analysis/xpop_background_<pop1>_<pop2>.tsv. Stage 4 annotates the significance files from stage 2 with the number of standard deviations that each AllEHH logratio value is from the mean. The results are stored in hapmap_phaseII/chr<chr>/xpop_significance_<pop1>_<pop2>.tsv. Stage5 identifies the top significance scores in every chromosome, collects them into one file and sorts them, most significant score first. The most significant scores in every chromosome (abs(logratio) > 3.5) are stored [unsorted] in the files hapmap_phaseII/chr<chr>/xpop_significance_<pop1>_<pop2>.tsv. The highest scores among all the chromosomes are written out (sorted) into hapmap_phaseII/xpop_sig_scores_<pop1>_<pop2>.tsv. A list of all the significance scores among all chromosomes is written out to hapmap_phaseII/xpop_sig_scores_<pop1>_<pop2>.tsv; this list is useful for looking at the genomewide distribution of scores, to see whether or not there's a skew towards one population. Stages 6-9, 9.5 and 9.75, 9.825 and 9.9375: LRH Analysis hapmap_phaseII/scripts/do_analysis.sh <stage> Options: lrh_options: options for the AnalyzeCores command (see Sweep docs) num_freq_bins: number of bins to use when calculating the frequency-dependent ln(REHH) background distribution. lrh_window_size: size of windows in which to group LRH scores (for stages 9.5 and 9.75) lrh_high_fraction: fraction of SNPs that must be significant in a window to declare selection in that window (for stages 9.5 and 9.75) lrh_windowed_threshold_lnrehh_dev: the minimum number of standard deviations above the background distribution mean that an ln(REHH) score has to be to be declared significant. lrh_tag: the prefix to attach to the filenames of windowed LRH analysis results. lrh_sliding_tag: the prefix to attach to the filenames of sliding window LRH analysis results. Stage 6 runs the LRH analysis on every SNP of every chromosome of every population, and stores the results in hapmap_phaseII/analysis/chr<chr>/lrh_<pop>.tsv. Stage 7 calculates the mean and standard deviation of ln(EHH) and ln(REHH) scores in every frequency bin (using the populations and chromosomes specified by the --chroms and --pops options), and stores the results in hapmap_phaseII/analysis/lrh_background_<pop>.tsv. Stage 8 annotates the analysis files from stage 6 with ln(EHH) and ln(REHH) significance scores and associate p-value and -log_10(p-value). The results are stored in the files hapmap_phaseII/analysis/chr<chr>/lrh_significance_<pop>.tsv. Stage 9 identifies the top significance scores in every chromosome, collects them into one file and sorts them, most significant score first. The most significant scores in every chromosome (ln(REHH) deviation > 2.5) are stored [unsorted] in the files hapmap_phaseII/chr<chr>/lrh_significance_<pop>.tsv. The highest scores among all the chromosomes are written out (sorted) into hapmap_phaseII/lrh_all_significance_<pop>.tsv. A list of all the significance scores among all chromosomes is written out to hapmap_phaseII/lrh_sig_scores_<pop>.tsv; this list is useful for looking at the genomewide distribution of scores, to see whether or not there's a skew in one direction. Stage 9.5 partitions each chromosome into windows of size lrh_window_size, and counts the fraction of significance scores in each window that are above lrh_windowed_threshold_lnrehh_dev. The results are stored in hapmap_phaseII/analysis/chr<chr>/<lrh_tag>_multi_regions_<pop>.tsv. Stage 9.75 collects the per-chromosome results of stage 9.5 into a genomewide dataset, sorts it with highest fraction of significant SNPs first, then outputs the results into hapmap_phaseII/analysis/<lrh_tag>_multi_regions_<pop>.tsv Stage 9.825 partitions each chromosome into windows of size 2*lrh_window_size, overlapping by lrh_window_size, and counts the fraction of significance scores in each window that are above lrh_windowed_threshold_lnrehh_dev. This stage depends on stage 9.5 having been run. The results are stored in hapmap_phaseII/analysis/chr<chr>/<lrh_sliding_tag>_multi_regions_<pop>.tsv. Stage 9.9375 collects the per-chromosome results of stage 9.825 into a genomewide dataset, sorts it with highest fraction of significant SNPs first, then outputs the results into hapmap_phaseII/analysis/<lrh_sliding_tag>_multi_regions_<pop>.tsv Stages 10-15: iHS Analysis hapmap_phaseII/scripts/do_analysis.sh <stage> Options: ihs_options: options for the iHS command (see Sweep docs) num_freq_bins: number of bins to use when calculating the frequency-dependent iHH logratio background distribution. ihs_window_size: size of windows in which to group iHS scores (for stages 13 and 14) ihs_high_fraction: fraction of SNPs that must be significant in a window to declare selection in that window (for stages 13 and 14) ihs_windowed_abs_iHS_threshold: the minimum absolute value of that iHS has to have to be declared significant. ihs_tag: the prefix to attach to the filenames of windowed iHS analysis results. ihs_sliding_tag: the prefix to attach to the filenames of sliding windowed iHS analysis results. Stage 10 runs the iHS analysis on every SNP that has ancestral allele information of every chromosome of every population, and stores the results in hapmap_phaseII/analysis/chr<chr>/ihs_<pop>.tsv. Stage 11 calculates the mean and standard deviation of the unstandardised iHS scores in every frequency bin (using the populations and chromosomes specified by the --chroms and --pops options), and stores the results in hapmap_phaseII/analysis/ihs_background_<pop>.tsv. Stage 12 annotates the analysis files from stage 10 with iHS scores and associate p-value and -log_10(p-value). The results are stored in the files hapmap_phaseII/analysis/chr<chr>/lrh_significance_<pop>.tsv. Stage 13 partitions each chromosome into windows of size ihs_window_size, and counts the fraction of iHS scores in each window whose absolute value is above ihs_windowed_abs_iHS_threshold. The results are stored in hapmap_phaseII/analysis/chr<chr>/<ihs_tag>_multi_regions_<pop>.tsv. Stage 14 collects the per-chromosome results of stage 13 into a genomewide dataset, sorts it with highest fraction of significant SNPs first, then outputs the results into hapmap_phaseII/analysis/<ihs_tag>_multi_regions_<pop>.tsv Stage 14.5 partitions each chromosome into windows of size 2*ihs_window_size, overlapping by ihs_window_size, and counts the fraction of significance scores in each window that are above ihs_windowed_abs_iHS_threshold. This stage depends on stage 13 having been run. The results are stored in hapmap_phaseII/analysis/chr<chr>/<ihs_sliding_tag>_multi_regions_<pop>.tsv. Stage 14.75 collects the per-chromosome results of stage 14.5 into a genomewide dataset, sorts it with highest fraction of significant SNPs first, then outputs the results into hapmap_phaseII/analysis/<ihs_sliding_tag>_multi_regions_<pop>.tsv Stage 15 identifies the top significance scores in every chromosome, collects them into one file and sorts them, most significant score first. The most significant scores in every chromosome (| iHS | > 2.5) are stored [unsorted] in the files hapmap_phaseII/chr<chr>/ihs_significance_<pop>.tsv. The highest scores among all the chromosomes are written out (sorted) into hapmap_phaseII/ihs_sig_scores_<pop>.tsv. A list of all the significance scores among all chromosomes is written out to hapmap_phaseII/ihs_all_significance_<pop>.tsv; this list is useful for looking at the genomewide distribution of scores, to see whether or not there's a skew in one direction. General notes: “population” here really refers to sample from one of the three populations.