Evidence for Malaria Selection of a CR1 Haplotype in Sardinia Supplemental Information Table S1. Source of Genotypes Genotype Source WTCCC (1958 British Birth Cohort) HYPERGENES U Michigan North Shore I-ControlDB (CHOP) i-ControlDB (HGDP) Totals Platforma Irish English Ashkenazi 1.2M 1.2M 550K 550K 550K 550K 39 367 2 Sardinian 137 a. Illumina platform used for genotyping. 58 60 367 120 Arab Norwegian 1 123 98 North South Italian Italian 1 24 148 337 142 23 35 56 22 395 220 58 4 104 63 104 Table S2. European Haplotypes for CR1 Region Showing Sardinian Selection Signala SNP Alleleb SNPc BPd rs4844599 rs4844600 rs12567973 rs12567990 rs11117949 rs12039306 rs11117956 rs11117959 rs6540435 rs4562624 rs7512361 rs12130494 rs4844601 rs10863358 rs6656401 1-205760525 rs11117991 rs6661489 205745852 205745930 205748124 205748308 205748879 205750032 205750982 205751142 205752387 205752588 205752813 205753997 205755850 205757494 205758672 205760525 205760980 205764667 205782363 205804492 205804493 205809879 205814607 205815416 rs35245495 rs7521382 1-205804493 rs3738467 rs17046851 rs650877 Allelese Imputation Haplotype Haplotype Haplotype Ancestral Pf A B C Alleleg G A C C C A G A C A C A G C A G C C T G G T T G T G T C T T T G G T T T 0.96 0.95 0.96 0.98 0.97 0.96 0.97 0.98 0.96 0.93 0.93 0.96 0.93 0.93 0.92 0.93 0.96 0.94 C T NA A A A A A G C C G G 0.90 0.94 NA 0.96 0.96 G G G T C A G G C C C T T C G T C C (C) A A (T) G G T G C C T G T A T C T A G G G T T C (T) G C (G) A A T A C C T G T A T A T A G G A T T T (T) G C (G) A A Amino Acid Variantsh G G C C T A G A T C C T T C G T T C T Ile643Thr G C G A G Gln981His rs12757487 rs599948 rs601356 rs614709 rs2274566 rs2274567 rs9659222 rs6687175 rs12034598 rs646817 rs1752688 rs1746659 rs11118131 rs3738468 rs10779330 rs11118133 rs608282 rs11118135 rs4844608 rs4844382 rs594955 rs12141045 rs11118136 rs7539922 rs677066 rs11118147 rs11118157 rs1408079 rs11118166 rs11118167 rs6691117 205815811 205818862 205819173 205819898 205819968 205820244 205821322 205823858 205824138 205824559 205824855 205825167 205827819 205828981 205829297 205829867 205832472 205837051 205837199 205837312 205837315 205838777 205839031 205839962 205840614 205841505 205844942 205846671 205848618 205848777 205849554 A C G C C A A A A A A A C A A A A A A A C C A A C A A A A C A G T T T T G G G G G G T T G G T T G G G T T G G T G C T G T G 0.96 0.99 0.97 0.99 0.98 0.99 0.99 0.98 0.99 0.99 0.99 0.99 1.00 1.00 1.00 NA 1.00 0.99 0.99 0.99 0.99 1.00 0.99 0.99 1.00 0.99 1.00 0.99 0.99 0.99 1.00 A C G C C G G G G G G A T A A (T) A G G A T C G A C G A T G C G G T T T T A A A A A A T C G G (A) T A A G C T A G T A C A A T A G T T T C A A A A A A T C G G (A) T A A G C T A G T A C A A T A A C G C C A A A A G G A C G G A T A A G T T G G T A A A G T G His1167Arg Ile1574Val rs12032275 rs3818361 rs7519119 rs12403552 rs7542544 rs6701713 rs2093761 rs2093760 rs10429953 rs11576522 rs10429943 rs3811381 rs12036785 205850130 205851591 205852775 205852793 205852846 205852912 205853165 205853451 205855201 205855892 205856094 205856711 205859532 C A A A A A A A A A C C A T G G G C G G G G G T G C 0.97 1.00 0.99 0.99 0.99 1.00 1.00 0.98 0.99 0.99 0.99 NA 0.99 T G G A C G G G G A T G C C G A G A G G G A G C (C) A C A G G C A A A A A C (C) C C G G A C G G G A A C C C Pro1786Arg a. The haplotypes were derived both genotyped SNPs and imputed SNPs. The genotyped SNPs are indicated in the probability colomn with a probability of 1.0. Imputed SNPs are indicated with posterior probabilities of <1.0. The imputation was performed using Impute V2.01 and the posterior probabilties were used for determining the haplotypes. The haplotypes in this Table were estimated using Beagle software2 using the posterior probability output for each possible genotype from Impute V2.0. In Sardinians the frequencies of these exact haplotypes were: A, 0.581; B, 0.230; and C, 0.125. These frequencies are similar to those derived using Haploview (Figure 2 in manuscript). b. The SNP allele for each haplotype is shown. The alleles for SNPs used in previous studies but not genotyped or imputed in the current study are included in the table within parentheses. These SNPs were not included in the quality filtered 1000 Genome preliminary results (June 2010 release). The most likely alleles for these SNPs are based on haplotypes inferred in previous studies3. c. The SNP rs number or for SNPs without rs numbers the chromosome and bp position (HG18) are shown. d. The bp position is shown for NCBI build 36.3 (HG18). e. Forward Alleles. f. Posterior probabilities for imputation assessment from Impute V2.0 analyses or genotypes. Not available (NA) are indicated for genotypes based on earlier studies (see footnote b). Where P = 1.0, all the SNPs all individuals had complete genotypes. g. The ancestral allele based on the Chimpanzee genotype is listed here. The ancestral status was obtained from UCSC Browser (http://genome.ucsc.edu/) and NCBI dbSNP (http://www.ncbi.nlm.nih.gov/snp) compilations. h. The amino acid variants are shown for SNPs with nonsynonymous changes. In each case the variant amino acid corresponded to the SNP in haplotype A and is listed after the amino acid position. The position of the amino acid is provided both without the 41 amino acid leader sequence. Many of the older references provide positions that include the leader sequence. The position without the leader sequence and corresponding positions with the leader sequence are as follows: Ile643Thr, Ile684Thr; Gln981His, Gln1022His; His1167Arg, His1208Arg; Ile1574Val, Ile1615Val; and Pro1786Arg, Pro1827Arg. Supplemental Note For SNP rs2274567 assuming a hard sweep model in which the selected mutant starts from a new mutant or very low frequency by a founder effect and assuming 25 years/generation, we used the haplotype sharing in Sardinian population to estimate: 1) Selection intensity s = 0.0054 (95% confidence interval: 0.0052 - 0.0056) 2) Allele age t = 45101 years (95% confidence interval: 43949 - 46253) Method for estimating allele age and selection intensity: We estimate the allele age and selection intensity of a selective sweep using the following procedures: 1. Allele age is estimated from the Extended Haplotype Homozygosity following the methods of Voight et al4. This model assumes that the decaying of haplotype homozygosity follows a Poisson process using: Pr (homozygosity) = e-2rg Where homozygosity (Pr) is the probability that two haplotypes are homozygous at a distance, r, to the selected mutant, and g is number of generations. As in Voight et al4 we choose Pr = 0.25, and the time in years, t = 25g . 2. After obtaining the point estimation of the allele age, we estimate selection intensity by assuming a deterministic logistic sweep model, in which we assume the change of selected allele frequency follows the logistic differential equation, and thus Where f is the allele frequency of the selected allele in current generation, f0 is the initial allele frequency before selection, and t is allele age. We assume a hard sweep model, selection starts with a new mutant, the initial allele frequency is assumed to be 0.0001. 3. Confidence intervals of allele age and selection intensity are estimated by bootstrap. Supplemental Information References 1. Howie BN, Donnelly P, Marchini J. A flexible and accurate genotype imputation method for the next generation of genomewide association studies. PLoS Genet 2009; 5(6): e1000529. 2. Browning BL, Browning SR. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am J Hum Genet 2009; 84(2): 210-23. 3. Xiang L, Rundles JR, Hamilton DR, Wilson JG. Quantitative alleles of CR1: coding sequence analysis and comparison of haplotypes in two ethnic groups. J Immunol 1999; 163(9): 4939-45. 4. Voight BF, Kudaravalli S, Wen X, Pritchard JK. A map of recent positive selection in the human genome. PLoS Biol 2006; 4(3): e72.