Supplementary Figure Legends - Word file

advertisement
Supplementary Figure Legends
Supplementary Figure S1. Comparison of % sequence identity for putative
chimpanzee and human segmental duplications based on an analysis of
each whole-genome assembly.
An independent whole-genome assembly
comparison (WGAC) was performed on the a) working draft version of the
chimpanzee genome assembly (Arachne 2001) and the b) near-finished human
genome assembly (build 34) to identify self-self pairwise alignments >90% and >
1kb in sequence length. For all pairwise alignments, the total number of aligned
bases was calculated and binned based on percent sequence identity. Sequence
identity distributions for interchromosomal (red) and intrachromosomal (blue)
duplications are shown. 136.7Mb and 152.3 Mb were identified in the
chimpanzee and human genome assemblies, respectively. Many more
fragmented duplications were observed within the chimpanzee assembly than
human. This fragmentation was shown to be an artifact of the genome assembly
process based on a comparison of finished chimpanzee genome sequence1. In
addition, an excess of high (>98%) sequence identity chimpanzee alignments
was observed. Such artifactual duplications are not uncommon in working draft
alignments and result from a failure to merge allelic overlaps especially in regions
of low sequence diversity 2 . Despite these limitations, chimpanzee WGAC
duplications cluster within the genome (Supplementary Fig. S7) and overlap
considerably with validated human genome duplications. Thus, chimpanzee
WGAC duplications may be used to identify potential duplicated regions but the
descriptive statistics (% identity, length) are not reliable as a sole determinant of
duplication sequence status.
Supplementary Figure S2. In silico estimate of cWSSD false positive rate
vs. length of duplication interval. We conservatively defined “false positive”
WSSD regions as duplication intervals where an excess of whole genome
shotgun sequence reads were detected (WSSD) but where there was no
evidence of overlap by any other measure (hWGAC, hWSSD or cWGAC). The
ratio of such solo WSSD regions was computed for each length interval based on
the total number of cWSSD regions of that length. A variety of WSSD trimming
parameters, repeat content and quality thresholds were compared (Filters 1-4,
see below). Irrespective of the applied filter, a maximum false positive rate of
~1.4% was calculated for cWSSD regions greater than 20 kb in length. Filter
thresholds were as follows: Filter 1: alignment length/read length=0.4, unique
sequence read length >= 300 bp, >200 high quality bases (Phred Q>27), trim
=1/5 of 5Kb); Filter 2: alignment length/read length=0.4, unique sequence read
length >= 300 bp, >200 high quality bases (Phred Q>27), trim =99.5% of 1 kb
threshold); Filter 3: alignment length/read length=0.8, unique sequence read
length >= 50 bp, >400 high quality bases (Phred Q>27), trim =99.5% of 1 kb
threshold); and Filter 4: alignment length/read length=0.8, unique sequence read
length >= 200 bp, >100 high quality bases (Phred Q>27), trim =99.5% of 1 kb
threshold).
Supplementary Figure S3. In silico estimate of cWSSD false negative rate
vs. length of the duplication interval. We defined chimpanzee WSSD false
negatives as those regions that were detected by the intersection of cWGAC,
hWGAC and hWSSD but which showed no evidence of duplication by cWSSD.
Various length intervals (hWGAC) were considered and the ratio of cWSSD
negative regions to the total number of intersections of cWGAC, hWGAC and
hWSSD was computed. Only WGAC alignments which showed >94% sequence
identity were considered as part of this analysis. Filter parameters were as
described in Supplementary Fig. S2. Filters 3 and 4 showed the lowest % false
negative due to the recruitment of more chimpanzee whole-genome shotgun
sequence data. At these parameters, we estimate a false negative rate of ~
6.5% at a length threshold of 20 kb. The false negative rate does not reach 0
irrespective of size due to the treatment of repeat sequences. In our analysis, we
penalized heavily for presence of recent common repeat elements (primate
specific L1P elements). If a region consisted of 50% of such repeats it was
eliminated. This was necessary to reduce false positives due to the lineagespecific expansion of different retroposon subfamilies. Therefore, segmental
duplication that consists largely of L1P or HERV sequences would be eliminated
from this analysis.
Supplementary Figure S4: Validated chimpanzee specific duplications by
FISH. a, b, c) Three examples of validated chimpanzee duplications where
multiple hybridization signals are observed within chimpanzee metaphase but not
human metaphase chromosomes (See Table 2). Human probe names (RPCI-11
BAC designations) and phylogenetic chromosome nomenclature are indicated.
Supplementary Figure S5. A global view of chimpanzee and human
segmental duplication intervals. The schematic shows the distribution of
chimpanzee-only (red), human-only (blue) and shared duplicated sequence
(black) between human and chimpanzee. Three different perspectives are
shown: a global view of the largest duplication intervals mapped against human
chromosome ideograms (X and Y chromosomes are not shown due to
insufficient coverage), a chromosomal view of human chromosomes 5 and 6, and
two regional views showing a gene-poor region duplicated in chimpanzee in
contrast to the gene-rich spinal muscular atrophy region (SMA) of human
chromosome 5. We determined that there was a 10.3 enrichment for CO and
HO duplications to map in proximity to sequence that was duplicated in both
species.
Supplementary Figure S6. Array comparative genomic hybridization
between human and chimpanzee (Clint) genomes. A full-tiling path
microarray of 32,855 of human BAC clones 3 was analyzed by array comparative
genomic hybridization using an anonymous human blood donor DNA as
reference and Clint chimpanzee DNA as test DNA. A reverse-label replicate
experiment was performed and the average log2 ratios (T/R) were plotted for the
corresponding coordinate for each BAC. Regions with three or more
consecutive BACs with log2 >0.5 (green) or three or more consecutive BACs < 0.5 (pink) are highlighted. Note the global decrease in chimpanzee signal
intensity for pericentromeric regions when compared to human. The experiment
was repeated with three unrelated chimpanzees with essentially the same result
(data not shown).
Supplementary Figure S7. Chromosome views of chimpanzee and human
segmental duplications. Four independent duplication analyses were
performed and the duplication coordinates mapped back against human genome
reference sequence for each chromosome. These include: human wholegenome assembly analysis of duplications (hWGAC, blue track), human wholegenome shotgun sequence detection analysis of duplications (hWSSD, black
track), chimpanzee whole genome assembly analysis of duplication (cWGAC
purple track) and chimpanzee whole genome shotgun sequence detection
analysis (cWSSD, light blue track) 2,4,5. Chimpanzee tracks were mapped directly
to the human genome (cWSSD) or mapped back by combined liftover and
BLAST sequence similarity searches (cWGAC) (see Methods). In the case of
the latter, if the chimpanzee sequence mapped to two or more regions with an
equivalent score, both were flagged as duplicated. Whole genome assembly
comparisons map all segmental duplications (>90% and >1 kb in length); while
WSSD analyses target all regions (>94% identity and >10 kb in length). Previous
studies have shown that duplications <96% are reliably predicted by WGAC
within draft genome assemblies. Chromosomes are divided into 1 Mb sections
with 100 Mb increments corresponding to genome coordinates (July 2003
Assembly). Above and below the tracks, the average % sequence identity of
cWSSD quality-corrected alignments and depth of coverage (5 kb windows) are
depicted, respectively. The latter provides an approximation of copy-number for
each region. 24 images are depicted corresponding to each human
chromosome. In addition, we considered separately those regions of the
chimpanzee genome that could not be mapped back to the human genome
assembly and the finished chimpanzee chromosome 22 (corresponding to
human chromosome 21). In these latter cases, coordinates are based on the
chimpanzee genome assemblies.
Supplementary Figure S8. Rationale for selection of binary duplicates: a)
Numerous studies have shown 6 that most segmental duplications emerge as a
result of a complex multistep duplication process. A simple two-step model is
shown whereby an initial duplication event occurs which is then followed, at a
later date, by three secondary (larger duplications) which generates three
additional paralogues. This creates a total of 10 pairwise alignments
corresponding to only 4 duplication events. The disparity between duplication
events and pairwise alignments becomes greater as more duplication steps are
invoked. Consequently, the number of pairwise alignments is not a surrogate for
the number of duplication events and complicates frequency estimates of gene
conversion and de novo duplication. b) Binary segmental duplications are
defined as duplicates where only one predominant pairwise alignment exists. In
such a scenario, a pairwise alignment is equivalent to an event. This provides a
reasonable estimate of the frequency of gene conversion and de novo
duplication. Within the human genome, multistep and binary duplicates may be
easily distinguished by consideration of the total number of underlying hWGAC
alignments within a region of segmental duplication.
Supplementary Figure S9. FISH analysis of hyperexpanded chimpanzee
segmental duplication: A human fosmid DNA clone (WIBR2-1785A6)
corresponding to the duplicated region was hybridized against a series of primate
chromosome metaphase and interphase nuclei: including a) human, b) common
chimpanzee, c) bonobo, d) gorilla, e) orangutan and f) macaque.
Reference
1.
Consortium. Initial sequence of the chimpanzee genome and comparison with the
human genome. Nature submitted. Nature (Submitted) (2005).
2.
Bailey, J. A., Yavor, A. M., Massa, H. F., Trask, B. J. & Eichler, E. E. Segmental
duplications: organization and impact within the current human genome project
assembly. Genome Res 11, 1005-17. (2001).
3.
Krzywinski, M. et al. A set of BAC clones spanning the human genome. Nucleic
Acids Res 32, 3651-60 (2004).
4.
She, X. et al. Shotgun sequence assembly and recent segmental duplications
within the human genome. Nature 431, 927-30 (2004).
5.
Bailey, J. A. et al. Recent segmental duplications in the human genome. Science
297, 1003-7 (2002).
6.
Samonte, R. V. & Eichler, E. E. Segmental duplications and the evolution of the
primate genome. Nat Rev Genet 3, 65-72 (2002).
Download