Supplementary Figure Legends Supplementary Figure S1. Comparison of % sequence identity for putative chimpanzee and human segmental duplications based on an analysis of each whole-genome assembly. An independent whole-genome assembly comparison (WGAC) was performed on the a) working draft version of the chimpanzee genome assembly (Arachne 2001) and the b) near-finished human genome assembly (build 34) to identify self-self pairwise alignments >90% and > 1kb in sequence length. For all pairwise alignments, the total number of aligned bases was calculated and binned based on percent sequence identity. Sequence identity distributions for interchromosomal (red) and intrachromosomal (blue) duplications are shown. 136.7Mb and 152.3 Mb were identified in the chimpanzee and human genome assemblies, respectively. Many more fragmented duplications were observed within the chimpanzee assembly than human. This fragmentation was shown to be an artifact of the genome assembly process based on a comparison of finished chimpanzee genome sequence1. In addition, an excess of high (>98%) sequence identity chimpanzee alignments was observed. Such artifactual duplications are not uncommon in working draft alignments and result from a failure to merge allelic overlaps especially in regions of low sequence diversity 2 . Despite these limitations, chimpanzee WGAC duplications cluster within the genome (Supplementary Fig. S7) and overlap considerably with validated human genome duplications. Thus, chimpanzee WGAC duplications may be used to identify potential duplicated regions but the descriptive statistics (% identity, length) are not reliable as a sole determinant of duplication sequence status. Supplementary Figure S2. In silico estimate of cWSSD false positive rate vs. length of duplication interval. We conservatively defined “false positive” WSSD regions as duplication intervals where an excess of whole genome shotgun sequence reads were detected (WSSD) but where there was no evidence of overlap by any other measure (hWGAC, hWSSD or cWGAC). The ratio of such solo WSSD regions was computed for each length interval based on the total number of cWSSD regions of that length. A variety of WSSD trimming parameters, repeat content and quality thresholds were compared (Filters 1-4, see below). Irrespective of the applied filter, a maximum false positive rate of ~1.4% was calculated for cWSSD regions greater than 20 kb in length. Filter thresholds were as follows: Filter 1: alignment length/read length=0.4, unique sequence read length >= 300 bp, >200 high quality bases (Phred Q>27), trim =1/5 of 5Kb); Filter 2: alignment length/read length=0.4, unique sequence read length >= 300 bp, >200 high quality bases (Phred Q>27), trim =99.5% of 1 kb threshold); Filter 3: alignment length/read length=0.8, unique sequence read length >= 50 bp, >400 high quality bases (Phred Q>27), trim =99.5% of 1 kb threshold); and Filter 4: alignment length/read length=0.8, unique sequence read length >= 200 bp, >100 high quality bases (Phred Q>27), trim =99.5% of 1 kb threshold). Supplementary Figure S3. In silico estimate of cWSSD false negative rate vs. length of the duplication interval. We defined chimpanzee WSSD false negatives as those regions that were detected by the intersection of cWGAC, hWGAC and hWSSD but which showed no evidence of duplication by cWSSD. Various length intervals (hWGAC) were considered and the ratio of cWSSD negative regions to the total number of intersections of cWGAC, hWGAC and hWSSD was computed. Only WGAC alignments which showed >94% sequence identity were considered as part of this analysis. Filter parameters were as described in Supplementary Fig. S2. Filters 3 and 4 showed the lowest % false negative due to the recruitment of more chimpanzee whole-genome shotgun sequence data. At these parameters, we estimate a false negative rate of ~ 6.5% at a length threshold of 20 kb. The false negative rate does not reach 0 irrespective of size due to the treatment of repeat sequences. In our analysis, we penalized heavily for presence of recent common repeat elements (primate specific L1P elements). If a region consisted of 50% of such repeats it was eliminated. This was necessary to reduce false positives due to the lineagespecific expansion of different retroposon subfamilies. Therefore, segmental duplication that consists largely of L1P or HERV sequences would be eliminated from this analysis. Supplementary Figure S4: Validated chimpanzee specific duplications by FISH. a, b, c) Three examples of validated chimpanzee duplications where multiple hybridization signals are observed within chimpanzee metaphase but not human metaphase chromosomes (See Table 2). Human probe names (RPCI-11 BAC designations) and phylogenetic chromosome nomenclature are indicated. Supplementary Figure S5. A global view of chimpanzee and human segmental duplication intervals. The schematic shows the distribution of chimpanzee-only (red), human-only (blue) and shared duplicated sequence (black) between human and chimpanzee. Three different perspectives are shown: a global view of the largest duplication intervals mapped against human chromosome ideograms (X and Y chromosomes are not shown due to insufficient coverage), a chromosomal view of human chromosomes 5 and 6, and two regional views showing a gene-poor region duplicated in chimpanzee in contrast to the gene-rich spinal muscular atrophy region (SMA) of human chromosome 5. We determined that there was a 10.3 enrichment for CO and HO duplications to map in proximity to sequence that was duplicated in both species. Supplementary Figure S6. Array comparative genomic hybridization between human and chimpanzee (Clint) genomes. A full-tiling path microarray of 32,855 of human BAC clones 3 was analyzed by array comparative genomic hybridization using an anonymous human blood donor DNA as reference and Clint chimpanzee DNA as test DNA. A reverse-label replicate experiment was performed and the average log2 ratios (T/R) were plotted for the corresponding coordinate for each BAC. Regions with three or more consecutive BACs with log2 >0.5 (green) or three or more consecutive BACs < 0.5 (pink) are highlighted. Note the global decrease in chimpanzee signal intensity for pericentromeric regions when compared to human. The experiment was repeated with three unrelated chimpanzees with essentially the same result (data not shown). Supplementary Figure S7. Chromosome views of chimpanzee and human segmental duplications. Four independent duplication analyses were performed and the duplication coordinates mapped back against human genome reference sequence for each chromosome. These include: human wholegenome assembly analysis of duplications (hWGAC, blue track), human wholegenome shotgun sequence detection analysis of duplications (hWSSD, black track), chimpanzee whole genome assembly analysis of duplication (cWGAC purple track) and chimpanzee whole genome shotgun sequence detection analysis (cWSSD, light blue track) 2,4,5. Chimpanzee tracks were mapped directly to the human genome (cWSSD) or mapped back by combined liftover and BLAST sequence similarity searches (cWGAC) (see Methods). In the case of the latter, if the chimpanzee sequence mapped to two or more regions with an equivalent score, both were flagged as duplicated. Whole genome assembly comparisons map all segmental duplications (>90% and >1 kb in length); while WSSD analyses target all regions (>94% identity and >10 kb in length). Previous studies have shown that duplications <96% are reliably predicted by WGAC within draft genome assemblies. Chromosomes are divided into 1 Mb sections with 100 Mb increments corresponding to genome coordinates (July 2003 Assembly). Above and below the tracks, the average % sequence identity of cWSSD quality-corrected alignments and depth of coverage (5 kb windows) are depicted, respectively. The latter provides an approximation of copy-number for each region. 24 images are depicted corresponding to each human chromosome. In addition, we considered separately those regions of the chimpanzee genome that could not be mapped back to the human genome assembly and the finished chimpanzee chromosome 22 (corresponding to human chromosome 21). In these latter cases, coordinates are based on the chimpanzee genome assemblies. Supplementary Figure S8. Rationale for selection of binary duplicates: a) Numerous studies have shown 6 that most segmental duplications emerge as a result of a complex multistep duplication process. A simple two-step model is shown whereby an initial duplication event occurs which is then followed, at a later date, by three secondary (larger duplications) which generates three additional paralogues. This creates a total of 10 pairwise alignments corresponding to only 4 duplication events. The disparity between duplication events and pairwise alignments becomes greater as more duplication steps are invoked. Consequently, the number of pairwise alignments is not a surrogate for the number of duplication events and complicates frequency estimates of gene conversion and de novo duplication. b) Binary segmental duplications are defined as duplicates where only one predominant pairwise alignment exists. In such a scenario, a pairwise alignment is equivalent to an event. This provides a reasonable estimate of the frequency of gene conversion and de novo duplication. Within the human genome, multistep and binary duplicates may be easily distinguished by consideration of the total number of underlying hWGAC alignments within a region of segmental duplication. Supplementary Figure S9. FISH analysis of hyperexpanded chimpanzee segmental duplication: A human fosmid DNA clone (WIBR2-1785A6) corresponding to the duplicated region was hybridized against a series of primate chromosome metaphase and interphase nuclei: including a) human, b) common chimpanzee, c) bonobo, d) gorilla, e) orangutan and f) macaque. Reference 1. Consortium. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature submitted. Nature (Submitted) (2005). 2. Bailey, J. A., Yavor, A. M., Massa, H. F., Trask, B. J. & Eichler, E. E. Segmental duplications: organization and impact within the current human genome project assembly. Genome Res 11, 1005-17. (2001). 3. Krzywinski, M. et al. A set of BAC clones spanning the human genome. Nucleic Acids Res 32, 3651-60 (2004). 4. She, X. et al. Shotgun sequence assembly and recent segmental duplications within the human genome. Nature 431, 927-30 (2004). 5. Bailey, J. A. et al. Recent segmental duplications in the human genome. Science 297, 1003-7 (2002). 6. Samonte, R. V. & Eichler, E. E. Segmental duplications and the evolution of the primate genome. Nat Rev Genet 3, 65-72 (2002).