Supplemental Information Methods: Generating the sequence data: Our experiment started with ~30,000 human genes annotated on Celera’s human genome version R26k. Of these thirty thousand genes, we de-selected 2,925 pseudogenes and 905 genes in the MCH region on chromosome 6 or on the Y chromosome; we selected 25,605 genes to undergo primer design. Primers were designed using Primer 3 (Steve Rosen and Helen J. Skaletsky (2000) Primer3 on the WWW for general users and for biologist programmers. In: Krawetz S, Misener S (eds) Bioinformatics Methods and Protocols: Methods in Molecular Biology. Humana Press, Totowa, NJ, pp 365-386). We required an amplicon size between 200550bp and that the target coding sequence must be at least 30bp from the ends of the amplicon. Of the approximately twenty-five thousand genes selected for primer design, we were able to cover an average of 92+/-2% of the coding sequence across 30 different molecular functions. We were not able to design amplicons targeting the coding sequence of 2,242 genes (8.8%). Genes failing primer design were not clustered among the different molecular functions. This left us with 202,078 primer pairs covering some portion of 23,363 human coding regions. The breakdown of primer pairs per gene is given in Table I. The identity and source of the DNA samples amplified is given in Table II (end of supplementary section). All Coriell samples were from the “Apparently healthy non-fetal tissue” NIGMS collection. The DNAs sequenced are unique and distinct from DNA panels 1 and 2 used by the Seattle SNPs group (http://pga.gs.washington.edu/platemaps.html#table1). Table I. Number of primer pairs per gene Percent of Number of primer pairs 23,363 genes per gene 45% Between 1-5 25% Between 6-10 11% Between 11-15 17% Between 16-70 2% More than 70 Sequencing coverage: Forward and reverse sequencing traces for each human amplicon were base called and assembled using PhredPhrapConsed package (Ewing B, Green P: Basecalling of automated sequencer traces using phred. II. Error probabilities. Genome Research 8:186-194 (1998)). If at least 10 traces gave high quality sequencing data (>200 contiguous bases of phred>20 scores) and assembled with the reference amplicon sequence, the amplicon set entered the SNP detection pipeline. Table III shows the total number of coding base pairs that were screened in different numbers of individuals. Two-thirds of the coding bases in 23,363 were screened in at least 35 DNAs, 82% of the coding bases were screened in at least 10 DNAs. Approximately 3,000 genes did not yield any good quality sequence data. On average 29 individuals were screened for each coding base; an average of 35 individuals were screened for each base that was sequenced successfully in at least 5 DNAs. One third of the genes had >90% of their coding sequence 1 covered in at least 35 individuals (Figure 1). 80% of the genes had >70% of their coding sequence covered in at least 10 individuals (Figure 1). Based on these numbers we could calculate the probability of detecting a SNP of a particular allele frequency in the population. For bases screened in at least 35 individuals we had 75% power to detect minor allele frequencies >2% and >99% power to detect alleles greater than 5%. For bases screened in at least 10 individuals we had 30% power to detect minor allele frequencies >2%, 60% power to detect alleles greater than 5%, and >99% power to detect minor allele frequencies >25%. Table III. Number of human individuals screened for SNPs coding bases coding bases with primers base pairs 29,975,619 27,583,680 Coding bases screened in at least 10 DNAs Coding bases screened in at least 20 DNAs Coding bases screened in at least 30 DNAs Coding bases screened in at least 35 DNAs 24,609,384 23,572,449 21,719,876 19,732,672 % of coding bases 92% 82% 79% 72% 66% Chimpanzee traces were analyzed independently of the human traces. 80% of amplicons that gave product in human were successfully sequenced in the chimpanzee. 60% %of genes min5 % of genes screened in x samples %of genes min10 %of genes min20 50% %of genes min30 %of genes min35 40% %of genes min39 30% 20% 10% 0% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% % of coding sequenced analyzed for SNPs Figure 1. Gene distribution of sequence coverage. The 10% bin refers to values >0 and <10% of coding sequence analyzed for SNPs. 2 Detecting the SNPs: Human polymorphisms were detected automatically from assembled sequencing traces using Polyphred 4.0 (Nickerson DA, Tobe VO, Taylor SL. (1997) PolyPhred: automating the detection and genotyping of single nucleotide substitutions using fluorescence-based resequencing. Nucleic Acids Res. 25: 2745-51) and RuleGen, a decision-tree based method (Stephen Glanowski, unpublished) using several sequencing trace/base parameters (signal strength, %GC, amplicon size etc.) Manual calls were employed if potential SNPs were not flagged by both programs. Validation of the automated pipeline using a set of 250 manually called SNPs in the same set of traces (96 amplicons) showed a sensitivity of 85% for all SNPs and up to 100% for SNPs with more than 3 minor alleles observed. Independent verification of several hundred SNPs using TaqMan assays indicated that a validation rate of 95% for common SNPs and 90% for SNPs with only one minor allele observed. Table IV. Comparison of Data Set 1 (this study) to Data Set 2 (described in text). No of snps with No of snps with % of coding bases sequenced in at >2 minor alleles <3 minor alleles least X number of DNAs in Data Set 2 in Data Set 2 avg no of DNA sequenced across the gene coding name region TFPI 24 F3 28 CYP4F3 30 C3 31 IL4 38 IL1F6 39 MC1R 39 TNF 39 5 10 20 30 35 64% 64% 64% 64% 64% 77% 77% 77% 66% 66% 82% 82% 82% 65% 65% 85% 85% 85% 77% 65% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 39 20% 54% 44% 34% 78% 75% 86% 92% present Not present Not in both present in both present Data in Data Data in Data Sets Set 1 Sets Set 1 0 0 0 0 0 1 1 1 3 1 1 3 11 1 5 2 1 0 0 0 2 0 1 0 5 1 4 1 0 0 0 0 False Negative Rate: We compared the coding SNPs from 8 genes that had been sequenced in the SeattleSNPs PGA (http://pga.gs.washington.edu/finished_genes.html) as an independent assessment of our false negative rate. Although the DNAs sequenced are different, the number of individuals sequenced is comparable (39 humans in this paper (referred to as data set 1) compared to 48 by SeattleSNPs (data set 2). Furthermore, both data sets contain roughly equal numbers of African American and European American individuals. In data set 1, four genes (Il4, IL1F6, MC1R, and TNF) were screened in ~38 DNAs for each coding base; the remaining 4 genes (TFPI, F3, CYP4F3, and C3) were screened in ~29 DNAs (data set 1 average). In summary, 91% (41/45) of SNPs with >2 minor alleles observed in data set 2 were also identified in data set 1. 3 All genes analyzed in this study have a corresponding entry in the Refseq 9.0 database with at least 95% nucleotide sequence identity and co-localization to build 34 of the Human Genome Project. SNPs occurring in regions of the Celera gene for which there was no Refseq support were omitted from mkprf analysis as were fixed nucleotide differences between all humans and the chimpanzee in these regions. Furthermore, 12.9% of genes have at least one synonymous or nonsynonymous SNP (total of 2,359 SNPs) in the Celera gene database in regions with Refseq support that were excluded from our analysis. (We include details as to which genes have more SNPs than we used in our analysis in the supplementary on-line information.) The mkprf algorithm was run with and without the omitted SNPs and genes in which omitting non-synonymous SNPs would lead to erroneous inference of positive selection were excluded from the final set of genes reported and used in the molecular function and biological processes section. Omitting non-synonymous SNPs from genes that show a signal of weak negative selection is conservative, since it reduces the posterior probability that the selection coefficient is below 0. Such genes were not removed the analysis and are noted in the supplementary table with McDonald-Kreitman cell entries and estimated selection coefficients. The omitted SNPs do not alter the main conclusions of the paper nor lead to gross differences in overall percentages. 4 Table II. DNAs sequenced for polymorphisms. DNA name ethnicity 4X0033 NA00131 NA14548 NA14448 NA12593 NA10959 NA09947 NA08587 NA05920 NA02254 NA01990 NA01954 NA01953 NA01814 NA01805 NA00946 NA00893 NA00607 NA00546 NA00333 NA10924 NA14672 NA14665 NA14663 NA14661 NA14649 NA14632 NA14535 NA14532 NA14529 NA14511 NA14508 NA14503 NA14501 NA14480 NA14476 NA14474 NA14464 NA14454 NA14439 not known European American European American European American European American European American European American European American European American European American European American European American European American European American European American European American European American European American European American European American European American African American African American African American African American African American African American African American African American African American African American African American African American African American African American African American African American African American African American African American taxon source Southwest National Primate Research Pan trodoglytes Center Coriell Cell Repositories Homo sapiens Coriell Cell Repositories Homo sapiens Coriell Cell Repositories Homo sapiens Coriell Cell Repositories Homo sapiens Coriell Cell Repositories Homo sapiens Coriell Cell Repositories Homo sapiens Coriell Cell Repositories Homo sapiens Coriell Cell Repositories Homo sapiens Coriell Cell Repositories Homo sapiens Coriell Cell Repositories Homo sapiens Coriell Cell Repositories Homo sapiens Coriell Cell Repositories Homo sapiens Coriell Cell Repositories Homo sapiens Coriell Cell Repositories Homo sapiens Coriell Cell Repositories Homo sapiens Coriell Cell Repositories Homo sapiens Coriell Cell Repositories Homo sapiens Coriell Cell Repositories Homo sapiens Coriell Cell Repositories Homo sapiens Coriell Cell Repositories Homo sapiens Coriell Cell Repositories Homo sapiens Coriell Cell Repositories Homo sapiens Coriell Cell Repositories Homo sapiens Coriell Cell Repositories Homo sapiens Coriell Cell Repositories Homo sapiens Coriell Cell Repositories Homo sapiens Coriell Cell Repositories Homo sapiens Coriell Cell Repositories Homo sapiens Coriell Cell Repositories Homo sapiens Coriell Cell Repositories Homo sapiens Coriell Cell Repositories Homo sapiens Coriell Cell Repositories Homo sapiens Coriell Cell Repositories Homo sapiens Coriell Cell Repositories Homo sapiens Coriell Cell Repositories Homo sapiens Coriell Cell Repositories Homo sapiens Coriell Cell Repositories Homo sapiens Coriell Cell Repositories Homo sapiens Homo sapiens Coriell Cell Repositories gender male female female female female female female female female female female female female female female female female female female female female female female female female female female female female female female female female female female female female female female female 5