Supplementary Information (doc 404K)

advertisement

Supplementary Material

Estimating time to the most recent common ancestor (TMRCA): comparison and application of eight methods

Contents

I Supplementary Methods ..................................................................................................... 2

1.1

Analysis of TMRCA with T-LD and T-FST ............................................................... 2

1.2

1.3

1.4

1.5

Analysis of TMRCA with DADI ................................................................................ 2

Analysis of TMRCA with GPho-CS ........................................................................... 2

Analysis of TMRCA with MIMAR ............................................................................ 3

Analysis of TMRCA with CoalHMM, PSMC and MSMC ........................................ 4

II Simulation Analysis ............................................................................................................ 4

2.1

Command line input to simulate data .......................................................................... 4

2.2

Command line to apply eight methods in simulation .................................................. 5

III TMRCA of Southeast Asian Malays and South Asian Indians ...................................... 7

3.1

Data ............................................................................................................................. 7

3.2

Methods of Estimating Divergence Time of Malay and Indian Population ............... 7

IV Supplementary Tables ................................................................................................... 10

I Supplementary Methods

1.1

Analysis of TMRCA with T-LD and T-FST

In the estimation of the TMRCA between two populations, T-LD and T-FST consider genomic sites that are polymorphic in at least one of the two populations. To minimise the impact of ascertainment bias, the analyses are restricted to SNPs with MAFs ≥ 5% in the combined set of chromosomes from both populations as suggested by McEvoy and colleagues 22 . In the four scenarios considered in the simulation, there were on average 29,038 common SNPs in the simple-isolation model; 29,166 common SNPs in the isolation-migration model; 6,772 common SNPs in the bottleneck-nonbottleneck model; and 7,980 common SNPs in the bottleneck-bottleneck model. For the TMRCA of Malays and Indians, there were 295,317 segregating sites shared by the Malays and Indians.

1.2

Analysis of TMRCA with DADI

DADI estimates TMRCA from the allele frequency spectrum of the variants present in the genomic region. For the simulation study, 500 sequences from each of the two populations were used to derive the allele frequency spectrum, where the outgroup sequence was used to determine the original and derived alleles. We specified three grid sizes (100, 200, 300) to extrapolate to an infinitely fine grid, and we assumed the default setting with an isolation model in all the DADI analyses of the simulation data. In addition, we also applied DADI to the simulation data assuming the specific model setting for the different demographic scenarios, to evaluate how DADI will perform with prior knowledge of the underlying demographic history between the two populations. Specifically, for the bottlenecknonbottleneck scenario in which two populations split at T, where the two populations subsequently have effective population sizes of Ne

1

and Ne s

, where Ne s

decreases exponentially to Ne b

at T b

before increasing exponentially to Ne f

at present, the parameters

(Ne

1

, Ne s

, Ne b

, Ne f

, T b

, T) are estimated simultaneously. Similarly, for the bottleneckbottleneck scenario, two populations split at T, and the i th population has an effective population size of Ne ib

immediately after the split, which increases exponentially to Ne if

at present, for i = 1, 2, the parameters (Ne

1b

, Ne

1f

, Ne

2b

, Ne

2f

, T) are simultaneously estimated.

For the analyses of the TMRCA between Malays and Indians, all 132 samples (96 Malays, 36

Indians) were used to compute the allele frequency spectrum across the 16,681,861 SNPs, where we ran two separate analyses assuming the isolation model and the bottleneckbottleneck model. The bottleneck-bottleneck model adopted the same design as in the analysis of the simulation data, except the project sample size was bounded between 70 and 120 and three grid sizes (140, 180, 200) were used for extrapolation.

1.3

Analysis of TMRCA with GPho-CS

GPho-CS considers neutral loci defined across multiple samples. For the simulations, we consider two haploid sequences from each population. The selected haploid sequences are divided into 10,000 segments each of length 1000 bases, and a constant population size was

assumed. We assumed the absence of migration in three scenarios except that of the isolation-migration model where we ran GPho-CS with and without migration bands. For the analyses of the simulated data, a burn in of 100,000 steps and 200,000 samplings were chosen. For estimating the TMRCA of Malays and Indians, 37,563 neutral one kilobase loci were identified, which removed sites under selection, with low sequencing quality and poor alignment 16 . The filtering criteria included removing simple repeats, recent transposable elements, indels, sites with effective coverage < 5, regions now showing conserved synteny in human/chimpanzee alignments, recent segmental duplications, CpGs and sites likely to be under selection such as exons of protein-coding genes, noncoding RNAs, and conserved noncoding elements. GPho-CS was applied to five haploid sequences at multiple loci, which included two haplotype sequences from SS6002734, two haplotype sequences from

SS6003427, and one chimpanzee reference sequence. The haploid sequences for the two human samples were phased using SHAPEIT 28 against the reference data from Phase 3 of the 1000 Genomes Project 29 . The chimpanzee reference haploid sequence was used to calibrate the mutation rate against the divergence time of 6.5 million years ago for human and chimpanzee, inferring an average mutation rate per site per year of 6.96 × 10 -10 , which is consistent with the literature applying GPho-CS to estimate population divergence time 16 .

1.4

Analysis of TMRCA with MIMAR

In the analysis of the simulation data with MIMAR, we considered one hundred haploid sequences from each of the two populations, where the original and derived alleles were determined from the outgroup population. The selected sequences are segmented into regions each of length 1000 bases, where we selected 900 non-adjoining loci (1-1000 bp,

2001-3000 bp, 4001-5000bp, to 1,798,001-1,799,000 bp) for analysis, and further divided them into 30 subsets in order to control the acceptance rate of the MCMC process to be at least 5% as recommended. The MCMC was run with a burn-in of 100,000 runs, and where we recorded 300,000 samplings afterwards. The default demographic model assumed an isolation model which was applied to all four scenarios in the simulation study, where we further assumed the scaled population mutation rates for the ancestral and two offshoot populations to be sampled from a Uniform[0.0001, 0.002] distribution. The population divergence time in generations was sampled from a Uniform[500, 3000] distribution for three of the four scenarios, except for the bottleneck-nonbottleneck scenario where the population divergence time in generations was assumed to be sampled from a

Uniform[1000, 5000] distribution. Separately, we also applied MIMAR under the same demographic model used to simulate the data. Specifically, for the isolation-migration model, we added a prior for the logarithm of scaled migration record ln(4 N e

m) as a

Uniform[-5, 3] distribution; for the bottleneck-nonbottleneck model, the population size was allowed to decrease exponentially between [T, 0.38 T] years ago at rate 4.5 × 10 -5 , and increasing exponentially between [0.38 T, 0] years ago at rate 1 × 10 -4 ; for the bottleneckbottleneck model, the population size increased exponentially at a rate of 5.8 × 10 -5 immediately after the split.

As MIMAR considers only neutral loci, for the estimation of the TMRCA for Malays and

Indians we extracted 37,563 one kilobase loci following the filtering procedure as suggested by the analysis with GPho-CS 16 . As MIMAR is computationally expensive and cannot handle thousands of loci simultaneously, each chromosome is divided into subsets each containing

30 one-kilobase loci. Similarly we assumed a burn-in of 100,000 runs, recorded 300,000 samplings, with a Uniform[0.0001, 0.002] prior for the population mutation rates of the ancestral population and the two populations (Malay, Indian), and a divergence time in generations distributed as Uniform[500, 3000]. A point estimate is derived for each chromosome as the average across the subsets, and the mean divergence time and corresponding 95% confidence interval were obtained from the point estimates of the 22 autosomal chromosomes.

1.5

Analysis of TMRCA with CoalHMM, PSMC and MSMC

CoalHMM and PSMC consider only two haploid sequences from the two populations. PSMC differs from all the other methods as it does not provide a point estimate for the TMRCA, instead it estimates the effective population size as a step function across time, and the

TMRCA is qualitatively determined as the time point when the effective population size increases to infinity. We adopted an effective population size threshold of 100,000 to determine the TMRCA. MSMC is highly similar to PSMC, except that it allows multiple haploid sequences from a population to be considered, where we apply MSMC to two haploid sequences from each population. While MSMC does not provide a point estimate for TMRCA, it provides a “cross-coalescence rate” which measures the relative gene flow between two populations. This is similarly a step function across time, and takes values between 0 and 1. Cross-coalescence rate decreases from 1 to 0, which translates to a decline in gene flow between two populations. As with PSMC, the estimation of TMRCA from MSMC is qualitatively determined, and we adopted a cross-coalescence rate threshold of 0.5 to identify the TMRCA.

For the estimation of the TMRCA between Malays and Indians, the haploid sequences from the same two individuals (SS6002734, SS6003427) were phased in the manner as described for the analysis with GPho-CS, and were analysed with CoalHMM and PSMC. The same effective population size threshold of 100,000 was used to determine a point estimate for the TMRCA. For the analysis by MSMC, all four phased sequences for the two individuals were used, and a cross-coalescence rate threshold of 0.5 was used to determine the point estimate for the TMRCA.

II Simulation Analysis

2.1

Command line input to simulate data

We use ms program and seq-gen program to generate sequence data with population parameters mu=1e-9, g=25, L=1e7, Ne_ref=1e4, r=5e-9 (rho=4Ne_ref*r*L). The simulation data we used can be downloaded from http://www.statgen.nus.edu.sg/~TMRCA/ . The command line we used are:

Simple Isolation Model

for i in {1..10}; do ms 1001 1 -t 10000 -r 2000 10000000 -I 3 500 500 1 -n 1 1 -n 2 1 -n 3 1 -ej

0.02 2 1 -en 0.02 1 1 -ej 4.1 3 1 -en 4.1 1 1.0 -p 7 > ms_Simple_Isolation_$i.txt ; done &

Isolation Migration Model

for i in {1..10}; do ms 1001 1 -t 10000 -r 2000 10000000 -I 3 500 500 1 -n 1 1 -n 2 1 -n 3 1 -em

0.0 1 2 4 -em 0.0 2 1 4 -ej 0.02 2 1 -eM 0.02 0.0 -en 0.02 1 1.0 -ej 4.1 3 1 -en 4.1 1 1.0 -p 7 > ms_Isolation_Migration_$i.txt ; done &

Bottleneck - Non-bottleneck Model

for i in {1..10}; do ms 1001 1 -t 10000 -r 2000 10000000 -I 3 500 500 1 -n 1 1 -n 2 0.5 -n 3 0.5 -eg

0.0 1 100.0 -eg 0.023 1 -45.0 -ej 0.06 2 1 -eN 0.06 0.5 -ej 4.1 3 1 -en 4.1 1 0.5 -p 7 > ms_Bottleneck_NonBottleneck_$i.txt ; done &

Bottleneck Bottleneck Model

for i in {1..10}; do ms 1001 1 -t 10000 -r 2000 10000000 -I 3 500 500 1 -n 1 1 -n 2 1 -n 3 0.5 -eg

0.0 1 58.0 -eg 0.0 2 58.0 -ej 0.04 2 1 -eG 0.04 0.0 -ej 0.04 2 1 -eN 0.04 0.5 -ej 4.1 3 1 -en 4.1 1 0.5 -p

7 > ms_Bottleneck_Bottleneck_$i.txt ; done &

2.2

Command line to apply eight methods in simulation

2.2.1

Hayes’ method (T-LD)& McEvoy’s method (T-FST)

We wrote some scripts to calculate TMRCA according to Hayes’ method and McEvoy’s method, T-LD-

FST.bash, which can be downloaded from http://www.statgen.nus.edu.sg/~TMRCA/ .

2.2.2

DADI

We wrote python scripts for analyses of DADI:

DADI-Simulation.py and DADI_Simulation_demographic_models.py.

The scripts can be downloaded from http://www.statgen.nus.edu.sg/~TMRCA/ .

The command line we used for analyses are:

Simple Isolation Model (default setting) python DADI-Simulation.py input Isolation-Migration True False > output

Isolation Migration Model (prior information) python DADI-Simulation.py input Isolation-Migration True True > migration.output

Bottleneck - Non-bottleneck Model (prior information) python DADI-Simulation.py input Bottleneck-NonBottleneck True False > Bottleneck-

NonBottleneck.output

Bottleneck Bottleneck Model (prior information) python DADI-Simulation.py input Twopop-Bottleneck-Model True False > Bottleneck-

Bottleneck.output

2.2.3

MIMAR

The command line we used for analyses are:

Simple Isolation Model mimar 300000 100000 30 -lf input -u 2.5e-8 -t u 0.0001 0.002 -n u 0.0001 0.002 -N u 0.0001 0.002 -ej u 500 3000 -v 3e-4 3e-4 300 3e-4 0.25 0.25 -i 10 -r 2e-4 -o exsoutput > output

Isolation Migration Model mimar 300000 100000 30 -lf input -u 2.5e-8 -t u 0.0001 0.002 -n u 0.0001 0.002 -N u 0.0001 0.002 -ej u 500 3000 -v 3e-4 3e-4 300 3e-4 0.25 0.25 -i 10 -r 2e-4 –o exsoutput > output mimar 300000 100000 30 -lf input -u 2.5e-8 -t u 0.0001 0.002 -n u 0.0001 0.002 -N u 0.0001 0.002 -ej u 500 3000 -M l -5 3 -v 3e-4 3e-4 300 3e-4 0.25 0.25 -i 10 -r 2e-4 -o migration.exsoutput > migration.output

Bottleneck - Non-bottleneck Model mimar 300000 100000 30 -lf input -u 2.5e-8 -t u 0.0001 0.002 -n u 0.0001 0.002 -N u 0.0001 0.002 -ej u 1000 5000 -v 3e-4 3e-4 300 3e-4 0.25 0.25 -i 10 -r 2e-4 -o exsoutput > output mimar 300000 100000 30 -lf input -u 2.5e-8 -t u 0.0001 0.002 -n u 0.0001 0.002 -N u 0.0001 0.002 eg 0.0 1 100.0 -ej u 1000 5000 -eg 0.0 1 100.0 -eg 0.38 1 -45.0 -eg 1.0 1 0.0 -v 3e-4 3e-4 300 3e-4

0.25 0.25 -i 10 -r 2e-4 -o BN.exsoutput > BN.output

Bottleneck_Bottleneck mimar 300000 100000 30 -lf input -u 2.5e-8 -t u 0.0001 0.002 -n u 0.0001 0.002 -N u 0.0001 0.002 -ej u 500 3000 -v 3e-4 3e-4 300 3e-4 0.25 0.25 -i 10 -r 2e-4 -o exsoutput > output mimar 300000 100000 30 -lf input -u 2.5e-8 -t u 0.0001 0.002 -n u 0.0001 0.002 -N u 0.0001 0.002 -ej u 500 3000 -en 0.0 1 1.0 -en 0.0 2 1.0 -eg 0.0 1 58.0 -eg 0.0 2 58.0 -eg 1.0 1 0.0 -eg 1.0 2 0.0 -en 1.0 1

0.5 -en 1.0 2 0.5 -v 3e-4 3e-4 300 3e-4 0.25 0.25 -i 10 -r 2e-4 -o BB.exsoutput > BB.output

2.2.4

GPho-CS

The parameters are set as follows:

Parameter

Burn-in iterations

Sample iterations

Sample-skip

Simulation Study

100 000

200 000

10

Real Data Analysis

100 000

300 000

10 prior for all parameters prior for all m parameters

Prior for divergence time

Prior for divergence time between human and outgroup

Search finetune automatically True True

2.2.5

CoalHMM

All the analyses are applied with the command below and the mean of 1000 MCMC samples are used as the parameter estimation. python isolation-model-mcmc.py -o output --logfile mcmc.log --mc3 -n 1000 --theta 0.001 --rho 0.4 input

2.2.6

PSMC

All the analyses are applied with the command below and the parameters obtained in the 20th iteration is used as the estimation. psmc -N25 -t15 -r5 -p "4+25*2+4+6" -o output.psmc input.psmcfa

2.2.7

MSMC

All the analyses are applied with the command below and the parameters obtained in the 20th iteration is used as the estimation. msmc_linux_64bit --fixedRecombination --skipAmbiguous -P 0,0,1,1 –o output input

III TMRCA of Southeast Asian Malays and South Asian Indians

3.1

Data

3.1.1

Malay Sample From Singapore Genome Variation Project (SGVP):

SSM.chr1.2012_05.bgl.phased.vcf.gz ~ SSM.chr22.2012_05.bgl.phased.vcf.gz and SS6003427.bam

3.1.2

Indian Sample Singapore Genome Variation Project (SGVP): chr1.phasing.vcf.gz ~ chr22.phasing.vcf.gz and SS6003427.bam

3.1.3

Human Genome Reference

Human reference file is downloaded from 1000 Genomes Phase III FTP: human_g1k_v37.fasta.gz

3.1.4

panTro2 reference chr1.hg19.panTro2.synNet.axt.gz ~ chr22.hg19.panTro2.synNet.axt.gz are download from UCSC: http://hgdownload.cse.ucsc.edu/goldenpath/hg19/vsPanTro2/syntenicNet/ .

3.2

Methods of Estimating Divergence Time of Malay and Indian Population

3.2.1

Hayes’ method, McEvoy’s method

We used phased variants data from SGVP for Hayes’ method, McEvoy’s method. The scripts used to calculate T-LD and T-FST are the same as the scripts used in simulation.

3.2.2

DADI

We used phased variants data from SGVP for DADI. We wrote python scripts for analyses of DADI:

DADI-Malay-Indian.py and demographic_models.py.

The scripts can be downloaded from http://www.statgen.nus.edu.sg/~TMRCA/ .

The command line we used for analyses are: python DADI-Malay-Indian.py chr{$CHR}.input Isolation-Migration True False > DADI-Malay-Indianchr{$CHR}-Isolation.output python DADI-Malay-Indian.py chr{$CHR}.input Twopop-Bottleneck-Model True False > DADI-Malay-

Indian-chr{$CHR}-Twopop-Bottleneck-Model_nomigration.output

3.2.3

MIMAR

We used phased variants data from SGVP for MIMAR. The command line we used to run MIMAR is: mimar 300000 100000 30 -lf chr{$CHR}.input -u 2.5e-8 -t u 0.0001 0.002 -n u 0.0001 0.002 -N u

0.0001 0.002 -ej u 500 3000 -v 3e-4 3e-4 300 3e-4 0.25 0.25 -i 10 -o MIMAR-Malay-Indianchr{$CHR}.exsoutput > MIMAR-Malay-Indian-chr{$CHR}.output

3.2.4

CoalHMM

We used phased variants data from SGVP for CoalHMM and we use human reference panel to reconstruct the haploid sequence. The command line we used to run CoalHMM is: isolation-model-mcmc.py -o CoalHMM-Malay-Indian-chr{$CHR}.mcmc --logfile CoalHMM-Malay-

Indian-chr{$CHR}.mcmc.log --mc3 --split 0.00001 --theta 0.001 chr{$CHR}.input;

3.2.5

GPho-CS

The neutral sites list as well as filter bed files are downloaded from http://compgen.bscb.cornell.edu/GPhoCS/data.php

.

We used samtools 1.0 to pileup the raw reads and call genotypes by bcftools. The command is: samtools mpileup -C 50 -u -r <chr> -f <ref.fa> <bam> | bcftools –cgI

We used shapeit program to phase the variants with command shown below. For those positions that are not present in the 1000GP reference panel, we randomly assign the haplotypes to two haploid sequences. shapeit -V tmp.vcf -M genetic_map_chr${CHR}_combined_b37.txt --input-ref

1000GP_Phase3_chr$CHR.hap.gz 1000GP_Phase3_chr$CHR.legend.gz 1000GP_Phase3.sample -O shapeit.out --exclude-snp alignments.snp.strand.exclude --no-mcmc shapeit -convert --input-haps shapeit.out --output-vcf phased.vcf

As suggested by GPhoCS developer, we filtered out simple repeats, recent transposable elements, indels, sites with effective coverage < 5, regions now showing conserved synteny in

human/chimpanzee alignments, recent segmental duplications, CpGs and sites likely to be under selection such as exons of protein-coding genes, noncoding RNAs, and conserved noncoding elements.

The parameters used for GPho-CS is given in Section 2.2.4.

3.2.6

PSMC

We used samtools 1.0 to pileup the raw reads and call genotypes by bcftools. samtools mpileup -C50 -uf ref.fa aln.bam | bcftools view -c - | vcfutils.pl vcf2fq -d 10 -D 60 | gzip > diploid.fq.gz

We phased the heterozygotes with the same fashion as GPho-CS and obtain one haploid sequence.

The pseudo-diploid sequence is constructed from two haploid sequences which each from one population.

3.2.7

MSMC

We follow the pipeline of MSMC to call variants and phase the data according to 1000 Genome Project Phase

III as reference.

IV Supplementary Tables

Supplementary Table 1. Mean error rates (MERs) and corresponding 95% confidence intervals.

Simulation T_LD mean_se_SI

CI.dn_SI

CI.up_SI mean_se_IM

CI.dn_IM

CI.up_IM mean_se_BN

CI.dn_BN

CI.up_BN mean_se_BB

-8.9%

-13.3%

T_FST

-0.5%

-5.5%

-4.6%

-24.0%

-28.2%

-19.8%

4.4%

-9.7%

-13.4%

-5.9%

-13.0%

-36.5%

-46.3%

-48.9%

10.5% -43.6%

-12.0% -14.0%

MIMAR

MIMARprior

15.7% NA

9.3% NA

DADI

DADIprior

-1.2% NA

-4.2% NA

22.1% NA

0.2% 44.3%

-3.5%

3.9%

41.7%

47.0%

-15.7%

-25.1%

-6.4%

-7.9%

-19.6%

-22.7%

1.8% NA

-17.0%

-18.5%

-15.5%

23.8%

-15.9%

-16.5% 63.5%

1.1% 110.0%

-0.1%

-5.7%

5.6%

14.0%

5.5%

22.5%

50.0%

CoalHMM Gpho-CS

-33.8%

-65.3%

-5.9%

-12.7%

0.9%

2.5%

-15.5%

-55.1%

-2.4% 24.0%

13.6% -52.1%

-19.2% -89.9%

46.4% -14.3%

4.3%

PSMC

1e6

0.6%

-16.0%

17.2%

37.5%

5.8%

69.2%

-35.3%

-5.8% -46.2%

14.5% -24.4%

23.3% -25.6%

PSMC

1e5

29.4%

-7.7%

PSMC

5e4

40.1%

4.7%

MSMC

-9.5%

-36.7%

66.5%

57.9%

75.5% 17.6%

64.5% -16.4%

17.8% 26.6% -24.6%

98.0% 102.5% -8.1%

-23.2%

-35.7%

-10.8%

-10.3%

-16.9%

-28.4%

-5.4%

-7.9%

-10.9%

-25.7%

4.0%

-2.4%

CI.dn_BB -24.7% -18.6% -11.6% -3.1% 98.6% 44.0% -12.2% 2.5% -37.8% -25.6% -20.7% -11.1%

CI.up_BB 0.7% -9.3% -4.3% 5.2% 121.4% 56.1% 17.3% 44.0% -13.4% 5.0% 4.8% 6.4%

Note: SI, IM, BN, BB represents (i) simple isolation model, (ii) isolation migration model, (iii) bottleneck-nonbottleneck model and (iv) bottleneck-bottleneck model, respectively. CI.dn and CI.up represents the upper and lower boundary of 95% confidence interval. –prior represents the results when proper demographic model is used in scenario (ii)-(iv). We includes the results of PSMC with three different thresholds, 1e6, 1e5 and 5e4.

Supplementary Table 2. TMRCA of Southeast Asian Malays and South Asian Indians and corresponding 95% confidence interval estimated by eight methods.

TMRCA of Southeast Asian

Malays and South Asian Indians T_LD T_FST MIMAR DADI.SI DADI.BB CoalHMM Gpho-CS

PSMC

1e6

PSMC

1e5

PSMC

5e4 MSMC mean 24173 59429 32535 54358 44512 17546 6594 20591 20715 36824 27508

CI.dn 23136 56242 31304 50205 20767 15156 5652 19963 20011 31726 25220

CI.up 25211 62615 33766 58511 68256 19936 7537 21218 21419 41922 29796

Note: CI.dn and CI.up represents the upper and lower boundary of 95% confidence interval. We includes the results of DADI with two demographic model, simple isolation model and bottleneck-bottleneck model in column DADI.SI and DADI.BB respectively. We includes the results of PSMC with three different thresholds, 1e6, 1e5 and 5e4.

Download