Word file (105 KB )

advertisement
Additional methods and discussion
Generating the Draft Genome Sequence
Screening for contamination in the sequence. Decontamination of the sequence reads was
carried out in multiple stages:
(1) In initial partial assemblies, we looked for unexpected clustering and for plates of reads
whose coverage by the rest of the assembly was anomalous (too low or too high). Such
suspicious plates were aligned against probable contaminants.
(2) After the final assembly, we discarded supercontigs whose reads tended to be plate-mates to
reads residing in the bottom 5% of the assembly. More specifically:
(a)
We scored each physical plate, by assigning it the ratio
(# of reads on plate whose supercontig has length ≥1 Mb)
--------------------------------------------------------(# of reads on plate whose supercontig has length <1 Mb).
(b)
We scored each read, by assigning it the corresponding plate score.
(c)
We scored each supercontig, by assigning it the median of its reads' scores.
(d)
We discarded supercontigs having score <10. There were 446 such supercontigs, of
which 66% had score 0. They contained a total of 11,850 reads, 81% of which were
accounted for by the 18 largest of the discarded supercontigs, all of which had score 0.
Note: We also checked for single-center contigs, having probability <10–6, but all were
already discarded.
(3) Delete supercontigs which have at least seven reads, all from the same library. Delete
supercontigs that together with all supercontigs linked to them have at least seven reads, all
from the same library. Repeat these steps on the assembly, prior to deletion of tiny
supercontigs, and translate back.
(4) Suppose that all the reads from a supercontig come from a single center, and moreover that
all the reads in all the supercontigs it links to come from the same center. Then delete the
supercontig. This was applied to the assembly prior to the deletion of tiny supercontigs, and
the list was then translated to the final (reduced) assembly.
(5) Remove suspected contaminants based on alignments to human sequence. This included
looking for regions of the human genome that had too many mouse reads aligning to them.
Almost all the contamination was removed by (1). The remaining steps removed a total of 1,902
contigs, totaling 3.6 Mb of sequence. Very little was removed by (5), although the method of (5)
was used to tune the methods (2), (3), and (4). In short, the decontamination methods were almost
entirely synthetic: based on internal inconsistencies rather than alignment to specific
contaminants. In spite of these precautionary steps, we note that the unanchored part of the
assembly is necessarily enriched for errors of many kinds.
Genome size: Euchromatic genome size was estimated by looking at the scaffolds and captured
gaps, which suggests a genome size of 2.5 Gb. A small fraction of the unanchored part of the
assembly will also contribute to the euchromatic size of the genome. It is currently hard to
estimate exactly how much of the unanchored sequence will fall in uncaptured gaps and at
centromeric and telomeric ends that contribute to the euchromatic part of the genome, but we
believe the majority will fall in captured gaps or in heterochromatic regions. Thus, we suggest
that the genome size is 2.5 Gb or slightly larger.
Comparison to Mural et al. Chromosome 16. Finished sequence used for comparison: B6
BACs AC079043, AC079044, AC083895, AC087541, AC087556, AC087840, AC087899,
AC087900, AC098735, AC098883, and 129 BACs AC000096, AC003060, AC003062,
AC003063, AC003066, AC005816, AC005817, AC006082, AC008019, AC008020, AC010001,
AC012526. We note that the BACs used for evaluation purposes in Mural et al.1 were actually
from Chromosome 6.
Conservation of Synteny Between Mouse and Human Genomes
Identification of orthologous landmarks. Full genomic alignments of the masked mouse
(MGSCv3) and human (NCBI build 30) assemblies were carried out using the PatternHunter
program2.
Only those alignments that were:
(1) high scoring, i.e., scoring ≥40 according to a standard additive scoring scheme: match = +1,
mismatch = –1, gap open = –5 gap extend = –1;
(2) bidirectionally unique at this scoring threshold; were used to identify orthologous landmarks
in both genomes.
Identification of syntenic blocks and segments. We first identified syntenic blocks; from these
we derived the collection of syntenic segments. Geometrically, syntenic blocks correspond to
rectangular regions in the mouse/human dot plots, while segments are curves with clear
directionality within each rectangle. Syntenic blocks are therefore defined by interchromosomal
discontinuities, while syntenic segments are determined by intrachromosomal rearrangements,
typically inversions.
A syntenic block of size X:
(1) is a contiguous region of at least size X along a mouse chromosome that is paired to a
contiguous region in a human chromosome, also of size X or larger;
(2) for which all interruptions by other chromosomal regions (in either genome) are less than size
X. Size can be measured either in terms of genomic extent in bases or as the number of
consecutive orthologous landmarks.
Our methodology for constructing syntenic blocks constructs low-resolution blocks (large
cutoff) from high-resolution blocks. For example, at the highest resolution possible, every
anchoring alignment is allowed to define or interrupt a syntenic block. To then obtain blocks
defined by at least two consecutive landmarks in both genomes, singletons in either genome
would be identified in the highest-resolution list and absorbed into pre-existing larger blocks.
In a similar manner, our methodology will coalesce smaller blocks into larger blocks for any
size cutoff while keeping segment boundaries as stable as possible. In our algorithm, one genome
is selected as the reference genome and determines the order in which the blocks are listed.
However, the blocks themselves are independent of the choice of reference genome. In fact,
changing reference frame from mouse to human provided a non-trivial consistency check on the
construction of the syntenic blocks. A syntenic segment of size X in mouse:
(1) is always contained within a syntenic block;
(2) exhibits clear directionality, with at least four successive markers in strictly increasing or
decreasing order in both genomes;
(3) is interrupted only by segments smaller that X in mouse;
Note that there is no size restriction placed on the corresponding human extent.
Intrachromosomal rearrangements within syntenic blocks were grouped into syntenic
segments by aggregating at successively coarser scales. More care is required when coalescing
segments, compared to blocks, to ensure that the resulting segments are truly reciprocal (one
mouse region paired to only one human region and conversely).
When defining segments, we excluded isolated outliers that seemed likely to be
attributable to misassemblies or sequencing errors, a typical case being a single misplaced BAC.
However, the fate of every syntenic landmark, including apparent outliers, was kept as part of a
'syntenic roadmap' to facilitate the coordinated, simultaneous navigation of both genomes.
Further details and dot plots can be found at:
http://www-genome.wi.mit.edu/mouse/synteny/index.html
Estimation of the minimal number of rearrangements. The estimate of the number of
rearrangements is based on the Hannenhalli-Pevzner theory for computing a most parsimonious
(minimum number of inversions) scenario to transform one uni-chromosomal genome into
another. This approach was further extended to find a most parsimonious scenario for multichromosomal genomes under inversions, translocations, fusions, and fissions of chromosomes3,4.
We used a fast implementation of this algorithm (5, available via the GRIMM web server at:
http://www-cse.ucsd.edu/groups/bioinformatics/GRIMM/index.html) to analyze the human–
mouse rearrangement scenario. Although the algorithm finds a most parsimonious scenario, the
real scenario is not necessarily a most parsimonious one, and the order of rearrangement events
within a most parsimonious scenario often remains uncertain. Availability of three or more
mammalian genomes could remedy some of these limitations and provide a means to infer the
gene order in the mammalian ancestor6.
The key element of the Hannenhalli-Pevzner theory is the notion of the breakpoint graph,
which captures the relationships between different breakpoints (versus the analysis of individual
breakpoints in previous studies). The breakpoint graph provides insights into rearrangements that
may have occurred in the course of evolution. Some of these rearrangements are almost 'obvious',
while others involve long series of interacting breakpoints and provide evidence for extensive
breakpoint re-use in the course of evolution.
5. Genome Landscape
GC Content. The human genome NCBI build 30 assembly and mouse genome MGSCv3
assembly were taken as genomic sequence. Both include runs of Ns for gaps in the sequence of
known, estimated, or unknown length, including centromeric sequence.
For analyses done on a genome wide or chromosomal scale, GC content was measured as
the total number of G or C bases in the genome or chromosome divided by the total number of
non-N bases in the same sequence.
For windowed analyses (histograms, correlations), each genome sequence was broken
into non-overlapping, abutting windows of fixed size (20 kb for GC distributions, ~100 kb for
syntenic windows, 320 kb for correlation with gene density) starting at the centromere for
acrocentric chromosomes and at the distal end of the p arm for metacentric chromosomes. All
windows are of identical size except the last window on the distal end (or distal q end) of each
chromosome, which contains the remainder bases regardless of number. Windows were analyzed
for GC content without regard to number of non-N bases; however, any window with fewer nonN bases than 50% of the nominal window size was eliminated to prevent artificially high variance
in the distribution. This eliminated no more than 1.7% of the non-null windows (2-7% of
windows depending on organism and window size consisted entirely of Ns as placeholders for
centromeres) or 0.7% of the total non-N bases for any organism/window combination and never
changed the global GC content value by more than 0.01%. The actual average number of non-N
bases per remaining window was 19155 for mouse and 19872 for human for 20 kb windows and
300825 and 314366 respectively for 320 kb windows.
Analysis of GC content of syntenic regions started with the high quality bidirectionally
unique anchors within syntenic segments (see above). Windows were selected as for the single
genome analysis, starting at the centromere of each mouse chromosome. Regions where no clear
synteny was present were skipped. We then selected non-overlapping, abutting windows which
were exactly 100 kb in the mouse sequence and interpolated the equivalent human position using
syntenic anchors and the average anchor spacing over the region. Due to small inversions in the
syntenic anchor order, some regions in the human may overlap. Actual average size of human
windows was 110 kb, as expected from the distribution of syntenic anchors. These regions were
then analyzed for GC content separately in each organism as described above. Pairs of windows
in which either organism had fewer than 50,000 non-N bases were discarded, effectively
eliminating all regions which were 2-fold or more shorter in human than mouse. Regions which
were 2-fold longer in human were not eliminated, but account for only 3% of windows.
For binned analyses of syntenic GC content, one organism was taken as the reference
organism and all of its windows binned in 1% increments centered on an integral percent GC
(i.e., 39.5–40.5). The GC distribution statistics of the second organism were then calculated by
window using all windows syntenic to each bin in the reference. No attempt was made to adjust
for different sizes and fractions of non-N bases in the windows.
For correlation of GC content with gene density, we took the ensembl sets of mouse and
human gene predictions (mouse release 7.3b.2, July 12, 2002 and human release 8.30.1,
September 2, 2002). This gave us 22,444 mouse genes and 22,920 human genes (60 human genes
from the full ensembl set could not be used because they were predicted on unlocalized contigs
not included in the NCBI genome build 30). These genes were then assigned to the same 320-kb
bins in which GC content had been measured. If a gene spanned more than one bin it was
fractionally assigned proportional to the fraction of its total transcript length which lay within that
bin (so the total of all genes in all bins is the total number of genes, but bins may contain
fractional numbers of genes).
CpG Islands. CpG islands were identified on masked versions of MGSCv3 and NCBI 30 using a
modification of the program used in7 (K. Worley and L. Hillier, personal communication). This
program uses the definition of CpG islands proposed by Gardiner-Garden and Frommer8 of 200
bp containing ≥50% GC and ≥0.6 ratio of observed to expected CpG sites (based on local GC
content).
The calculations were also run independently varying minimum GC from 46–54%, o/e
from 0.4–0.8, and length from 100–400 bp. While parameter shifts in o/e and length requirements
significantly altered the total number of islands found in each organism, there was negligible
effect on the ratio of islands between the two organisms. Changes to the minimum GC resulted in
a very small change in the number of islands found, as the vast majority of islands in both
organisms significantly exceed this threshold.
Expansion ratio. Syntenic windows were determined as above. The ratio mouse/human was
calculated for all windows. Windows with a ratio <0.25 or >4 were excluded from
calculations/plots.
6. Repeats
Additional legend Figure 10. Age distribution of Interspersed Repeats (IRs) in the mouse
and human genomes. Bases covered by each repeat class were sorted by the estimated
substitution level from their respective consensus sequences. Divergence levels from the
RepeatMasker output were adjusted to account for 'mismatches' resulting from ambiguous bases
in the consensus and genomic sequences. Often sequencing gaps represented by strings of 100–
500 Ns were overlapped by the matches, which would lead to huge overestimates of the
divergence levels if not adjusted for. Since CpG->TpG transitions are about 10-fold more likely
to occur than all combined substitutions at another site, repeats with many CpG sites (like Alu)
are more diverged than those of the same age with few CpGs. We estimated the divergence level
excepting CpG->TpG transitions (Drest) from the adjusted observed divergence level (Dobs) and
the CpG frequency in the consensus (Fcg) by Drest = Dobs/(1+9Fcg), with a minimum Drest of DobsFcg. The substitution level K (which includes superimposed substitutions) were calculated with
the simple Jukes-Cantor formula K = –3/4ln(1–4/3D rest)). Panels (b) and (d) show them grouped
into bins of approximately equal time periods. On average, the substitution level has been 2-fold
higher in the mouse than in the human lineage (Table 6), but currently may differ over 4-fold.
Compared to the previous version, the scale on the X-axis in panel B is larger, as we estimate in
this paper that the substitution level in mouse since the human-mouse speciation is at least 35%.
Also, the time periods in panels (b) and (d) are smaller, assuming a speciation time of 75–80
Mya, rather than 100 Mya.
Additional legend Table 6. Divergence levels of 18 families of IRs that shortly predate the
human-mouse speciation. Their copies are found at orthologous sites in mouse and human while
having a relatively low divergence level or representing the youngest members in the
evolutionary tree of LINE1 and MaLR elements. Shown are the number of kilobases matched by
each subfamily (kb), the median divergence (mismatch) level of all copies from the consensus
sequence (div), the interquartile range of these mismatch levels (range), and a Jukes-Cantor
estimate of the substitution level to which the median divergence level corresponds (JC). The two
right columns contain the ratio of the JC substitution level in mouse over human, and an 'adjusted
ratio' of the mouse and human substitution level after subtraction of the approximate fraction
accumulated in the common human-mouse ancestor. Many factors influence these numbers. For
example, AT-rich LINE1 copies appear less diverged than the GC-richer MaLR and DNA
transposon families of the same age, primarily because GC->AT substitutions are more common
AT->GC substitutions, especially in the AT-rich DNA where most LINE copies reside7. Early
rodent specific L1 and MaLR subfamilies are not yet defined, so that their copies were matched
to the consensus sequences in the table (note that the youngest L1 subfamily, L1MA6, has
relatively much DNA matched to it). The associated, unduly high mismatch levels (L1 evolves
faster than the neutral rate!) will increase the rodent median and the substitution level ratio. On
the other hand, inaccuracies in the consensus and not-represented minor ancient subfamilies
contribute equally to the observed mismatches in both species and cause the ratio to be smaller.
Three more-important factors cause a significant underestimate of the substitution level
in mouse compared to human. First, part of the substitutions in older families has accumulated in
the common ancestor. The difference in substitution level between the family and the least
diverged family in the class estimates this fraction, and is subtracted before calculating the ratio
in the last column of the table. Second, by assuming that all substitutions are equally likely, the
Jukes-Cantor formula significantly underestimates the number of superimposed substitutions at
higher divergence levels. For example, when considering substitution patterns, 30% mismatches
to an average DNA sequence in an average environment correspond to a 41% rather than 37–38%
substitution level. Finally, there undoubtedly is a bias of ascertainment for the least diverged
copies of the repeat family in mouse. Dependent on the length of the match, 30–35% mismatch
level is about the maximum that can be detected by RepeatMasker, so that the more diverged
copies are not tallied. The above suggest that the ratio of substitution rate in the lineages to
human and mouse is at least 2.0-fold.
Higher substitution rate in mouse lineage. We know that human and chimp are ~ 6 Mya
separated and have a 1.25% substitution level 9, or 0.21% per My. This is likely to be even lower
in the longer-generation-time human branch. Mouse and rat split ways 10-20 Mya (14 Mya used
by Huchon et al.10) and show a 10-12% substitution level11, giving 0.5-1.2% substitutions per My,
or a 2.5-6 fold difference in substitution rate per year between mouse and human.
Mouse Genes
Gene Build. We used 7 gene building systems – Ensembl 12 with additional proteins from the
Riken cDNA collection which had ORFs > 100aa (23,026 genes), Fgenes++ (37,793 genes), the
NCBI pipeline, which included Genomescan13 predictions (46,158 genes), Genie14 (18,548
genes), Ensembl EST only build (46,646 genes), SGP15 (48,451 genes), Twinscan16 (48,462) and
SLAM17 (14,006 genes). These sets were had a core set of around 20,000 genes which nearly all
the methods predicted at least partially, but the methods varied outside this set. After a intensive
examination of the results we rationalised this set by taking the union of Ensembl and Genie
predictions as our starting point, as these methods were most confirmed by the other methods. We
then accepted any transcript predicted by the other methods if more than 80% of the Ensembl or
Genie transcript was contained within it, and substituted the longer transcript. The final set of
transcripts were clustered by single linkage clustering on the basis of one overlapping exon into
“gene sets”, some of which will represent more than one recognisable gene. In our hands this
procedure was a good balance between coverage and specificity, in particular for the downstream
protein analysis. We produced 29,201 transcripts grouped into 22,011 gene sets.
Orthology. We took matched Ensembl builds for human (build 7_29a) and mouse (7_3) which
were built from the same set of protein databases. Because of this symmetry we expect if a gene
is conserved between mouse and human that either it will be present in both genomes or neither.
We then took the longest transcript in each gene, used reciprocal best hit via BLAST to identify
our starting ortholog pairs. Three ortholog pairs in order within 100KB of each other provided
intial syntenic regions that we then grew out along the chromosomes, allowing minor
rearrangements and misplaced genes as long as there was an additional gene pair consistent with
the synteny within 100KB. This gene level synteny map was in broad agreement with the DNA
level synteny map. Into the resulting synteny map we placed cases with ambiguous best hits
where this was resolved by the synteny and potential duplications. Threee classes of genes
remained at the end of the analysis, being apparent local duplicates of genes (5,431) unmatched
genes in the mouse genome (1416), or genes with matches that disagreed with the synteny (2705).
These cases were subjected to a second round of analysis. For unmatched mouse cases, a second
BLAST search on the human genome was performed, and then the main supporting evidence was
classified for this gene. We discarded genes with a poor (<150 Bits) match (this step is to remove
certain classes of weak prediction artefacts) and genes whose only support was from the early
Riken cDNA project or had retroviral elements. The resulting 141 proteins were then manually
examined and showed believable biological signals, such as being involved in olfaction. For the
apparent local duplicates we took a sample of 50 genes and classified them as being fragmented
copies (10%), real genes (47%), local pseudogenes (usually only one or two exons) (36%) or
misprediction/unclassified (17%). For the misplaced genes we also took a sample of 50 genes and
examined them by hand; 76% were clear pseudogenes, 16% apparently real genes, 5% dubious
gene prediction and 3% misassembly (in all cases human). We applied these estimates to the
complete pools in each set to provide the overall numbers of pseudogenes (4,010), gene
prediction artefacts but likely to be real genes (543), gene predictions solely with Riken support
(591), other artefacts (838), real duplicates (2552) and real apparently unsyntenous genes (432).
Pseudogenes. The mouse pseudogene collection was obtained with a combination of sequence
similarity searches and a scale up of a methodology that uses the ratio of non-synonymous versus
synonymous substitutions (KA/KS) as an indicator of the absence of functionality. All repeatmasked regions between mouse genes were compared using BLASTx18 against a non-redundant
protein database comprising EMBL CDS translations + PDB + SwissProt + PIR annotations
(NRDB). The regions with significant sequence similarity (BLAST E value < 0.001) to (non-viral
and non-transposon) proteins were aligned to the closest protein match using Genewise19. In order
to confirm consistency and to determine the parentage, each aligned region was again compared
to NRDB and to the complete mouse gene collection. For the remaining regions we inferred its
hypothetical 'ancestral' sequence with PAMP (PAML package20) using the cDNAs of the two
closest non-identical proteins of NRDB. The codon-based alignment of each region with its
'ancestral' sequence was then used to calculate the associated KA/KS ratios. The functionality test
was done by comparing with the 'least squares fitting' procedure the resulting KA/KS distribution
with two other KA/KS distributions obtained from reliable functional (2000 non-redundant full
length Riken cDNAs) and pseudogenic (processed and disrupted intergenic regions) training sets.
Estimation of the mammalian gene count from evidence based sets. From 128,847 exons with
support from the Riken set, 106,575 exons were also predicted by the initial gene build, giving an
estimated sensitivity of the Ensembl build at 79%. To estimate the specificity we took the
estimated numbers of pseudogenes and mispredictions from the orthology analysis, and using an
average number of exons/gene of 3 estimate the number of miscalled exons around 15,000 i.e., a
specificity of 92%. We examined other routes to estimate the sensitivity and specificity (e.g., at
the gene level, not the exon level) and all methods suffer from getting a sensible estimate of the
specificity of predictions. In our hands, exon level calculations are more robust as the number of
exons per gene vary dramatically across different gene prediction artefacts, making predicted
gene numbers a poor estimator of total genes.
RNA Genes. ncRNA annotation was performed essentially as described7. tRNA genes were
detected using tRNAscan-SE 1.2321. Other genes and pseudogenes were detected using WUBLAST 2.0 (15 April 2002 release; W. Gish, unpublished).
Mouse Proteome
Amino acid sequences derived from the Mouse Gene Set were compared with protein and family
databases. The comparison with all other known protein sequences used BLASTP18 and the NCBI
non-redundant database (ftp://ftp.ncbi.nlm.nih.gov/blast/db/nr). Homology was inferred for all
alignments with expect values < 10–4. Taxonomic groupings were taken from the NCBI taxonomy
database (http://www.ncbi.nlm.nih.gov/Taxonomy/). Mouse proteins were also compared with
the InterPro classification of domains, motifs and proteins using InterProScan22. An identical
procedure was used for the annotation of human proteins from the Ensembl4.28.1 set. The
assignment of Gene Ontology (GO) terms 23 to mouse and human proteins was derived from
InterproScan output (ftp://ftp.ebi.ac.uk/pub/databases/interpro/interpro2go).
Sequence comparisons of orthologues used Ensembl versions 6.3a (mouse) and 5.28 (human). All
accessions cited refer to these versions. For each of the 12845 Ensembl 1:1 orthologous gene
pairs, the mouse and human transcripts which had the highest local alignment scores were
selected, and their protein sequences aligned using BLAST-2-Sequences24 and the BLOSUM80
substitution matrix. Their corresponding protein-coding DNA sequences were subsequently
aligned in accordance with the protein sequence alignments. Mouse-human orthologue pairs were
subdivided into different classes according to predicted subcellular localization25, enzymatic
activity and the presence or absence of InterPro domains. InterPro domains whose predicted loci
in mouse and human did not overlap by more than 80% of the aligned residues were ignored. This
information was collated within a PostgreSQL (http://www.postgresql.org/) relational database
management system (‘Panda’) devised for this project (L. Goodstadt, et al., unpublished). KA and
KS values were estimated using the method of Yang & Nielsen26 KA/KS ratios were only included
for orthologue pairs whose KA and KS were both greater than zero. In calculating the percentage
amino acid identity between two sequences, the number of amino acid identities was divided by
the total number of alignment positions, including positions where one sequence was aligned with
a gap.
Homologous genes that are clustered in loci within the mouse genome were detected by
comparing the protein sequence of each gene with those of five genes on either side using
BLAST-2-Sequences 24and an expect value threshold of 10–4. For every gene of each cluster,
closely similar human homologues were detected using BLASTP. Subsequently, human
orthologues were assigned using the Ensembl orthologue pair set or by the construction of
dendograms using Clustal-W and neighbour-joining methods27.
Genome Evolution: Mutation and Selection
Constructing the Alignments. Ancient interspersed repetitive elements (see Table 6) and tandem
repeats were soft-masked in both sequences using RepeatMasker, and lineage-specific repeats
(i.e., interspersed repeats believed to have inserted after the human-mouse split) were physically
removed. Human sequence from the June 2002 assembly (http://genome.ucsc.edu) was divided
into 1.01-Mb pieces starting every megabase in the human assembly, and mouse sequence was
divided into 30-Mb segments. The following nucleotide substitution scores, described by
Chiaromonte et al.28 were used throughout the alignment process.
A
C
G
T
A
91
-114
-31
-123
C
-114
100
-125
-31
G
-31
-125
100
-114
T
-123
-31
-114
91
For steps that permit gaps in the alignment, a gap of length k was penalized by
subtracting 400 + 30 k from the sum of substitution scores. Our alignment program, called blastz,
uses the familiar the three-step strategy used by Gapped Blast18; find short near-exact matches,
extend each short match without allowing gaps, and extend each ungapped match that exceeds a
certain threshold by a dynamic programming procedure that permits gaps.
For short near-exact matches we adapted a very clever idea of Ma el al.2 that looks for
runs of 19 consecutive nucleotides in each sequence, within which the 12 positions indicated by a
'1' in the string
1110100110010101111
are identical. To increase sensitivity, we allow a transition (A-G, G-A, C-T or T-C) in any one of
the 12 positions. Each such match was check to see if it can be extended in both directions to
produce a gap-free alignment scoring at least 3000 when the sum of the substitution scores is
multiplied by a factor between 0 and 1 that measures sequence complexity in the aligned region,
as described by Chiaromonte et al.28. The aim of this factor is to make it harder for lowcomplexity regions (such as CACACA...) to match. Ancestral repeats were then unmasked and
the ungapped alignments extended by a dynamic-programming strategy that permits gaps. These
three steps were applied recursively to fill in gaps between adjacent syntenic aligned regions,
where the separation between alignments was at most 50kb in each species, but using criteria that
are much more sensitive, e.g., the gap-free alignment that seeds a gapped alignment was required
to score only 2200 and the penalty for low-complexity regions was not imposed. Finally, the local
alignments were filtered such that in the roughly 2.5% of the human genome where there were
multiple regions in mouse aligning to the same region in human, only the most significant
alignment was retained. Source code for our alignment program, blastz, as well as code for
removing lineage-specific repeats and adjusting alignments to refer to the original sequence
positions, can be down-loaded from http://bio.cse.psu.edu/. The code that we used to select one
alignment in human regions that aligned to several locations in the mouse genome can be
obtained from Jim Kent. The alignment was built on the 1000 CPU Linux cluster at UCSC. The
alignments can be viewed via the genome browser at http://genome.ucsc.edu.
Validating the Alignments. Three independent groups ran whole genome alignments in June,
2002. One group used PatternHunter with two weight 11 spaced seeds2. This ran in 28 CPU days.
Another group used the Avid global aligner 29 (Bray et al., unpublished data) anchored to
nucleotide BLAT30 hits (pipeline.lbl.gov). This approach took 20 CPU days. The third group ran
blastz as described above, which took 516 CPU days. We analysed the alignments in detail on
human chromosome 20. To help compare the alignments we took highly conserved, moderately
conserved, and lightly conserved subsets of each alignment and compared them with each other.
PatternHunter2 and blastz are both local aligners. Blastz alignments were generally a nearly
perfect superset of PatternHunter alignments. Blastz covered 96%, 96%, and 99.5% of the highly
conserved, moderately conserved, and lightly conserved PatternHunter alignments. Conversely
PatternHunter covered 89%, 75%, and 26% of the blastz alignments. The Avid/BLAT approach
incorporated a global aligner, and intersects blastz less heavily. Blastz covered 88%, 85%, and
80% of the Avid alignments at the three levels. Conversely Avid covered 87%, 81%, and 67% of
blastz.
Note that these results were on the April freeze of the human genome, and all groups
have improved their methods since then.
Table SI 7 shows an alignment of a highly conserved coding region. The less conserved regions
are more difficult. Table SI 8 shows an alignment of a transposon relic that was inserted into the
mammalian genome before the mouse/human split. In making the program sensitive enough to
make alignments such as these we were afraid that we also might the program generate
alignments between regions that did not share a common ancestor. As a control for this, we
reversed (without complementing) the mouse genome, and repeated the mouse/human alignment.
0.16% of the human genome was covered by alignments of the reversed mouse genome, as
opposed to 40% of the genome covered by the forward alignments.
To get a sense for how much divergence blastz would tolerate while still making accurate
alignments we constructed two synthetic test sets. One test set consists of 1000 base pairs of
mouse DNA mutated by various degrees of base substitutions in silico but no insertions or
deletions (indels). We ran blastz on the original DNA vs. the mutated DNA. For base identities of
60% and more in our tests blastz does not put any insertions or deletions into this alignment.
When the base identity is 55% blast inserts an indel every 500 bases. When the base identity is
50% blastz does not produce any alignments in 20 of 20 tests. The second test set consists of the
same DNA mutated by various percentages of deletions, but no substitutions or insertions. When
the deletion rate reaches 3%, blastz introduces one substitution per 300 bases. At a deletion rate
of 4% blastz introduces one substitution per 200 bases. At a deletion rate of 11% blastz
introduces one substitution per 30 bases. At a deletion rate of 20% blastz only creates alignments
in 8 of 20 tests. Genome-wide, the blastz mouse/human alignments have an average base identity
of 70% and an insertion or deletion rate of just under 3%. Overall blastz alignments appear more
likely to diverge from true homology when the indel rate is higher than average than when the
substitution rate is higher than average. However the large majority of the mouse/human
alignments occur at a level of sequence divergence where blastz appears to have a very low error
rate.
Estimating Amount of Human Genome Expected to Align to Mouse. A back-of-the-envelope
estimate that about 42% of the human genome would be expected to align to mouse is given in
the text. Here we present refinements on this calculation. To estimate turnover in the human and
mouse genomes since their common ancestor, we assume that the euchromatic genome size of the
ancestor was similar to the modern human genome at ~2.9 Gb. There would thus have been no
net change in genome size in the human lineage and an ~400 Mb decrease in the mouse lineage.
The change in size within aligning regions of the human genome from the size they were in the
ancestor is small: we estimate that they are 2% smaller in human than they were in the ancestor.
This estimate comes from alignments of ancestral repeat sequences to their consensus sequence.
Such alignments show a net loss of 2% when one subtracts deletions from insertions. Mouse
alignments show a larger net loss of about 5.6%.
Other processes that cause the size to change are large-scale insertions, deletions, and
duplications. The human genome has added ~700 Mb of lineage-specific repeat sequence since
the common ancestor (see section of the paper on repeats). It has also added new sequence by
segmental duplication, which, if you don’t count the lineage-specific insertions within the
duplicated segments, adds another 2% or about 60Mb. Local tandem duplications are estimated to
have added an additional 50 Mb. This gives a total amount of new DNA from these processes of
about 810 Mb. This is offset by only about 20 Mb of shrinkage in the alignable portions inherited
from the ancestor (anticipating that very roughly half will be found to be inherited). This gives a
net increase of 790 Mb. This increase would have been offset by a roughly comparable amount of
deletion. This implies that ~27% (=790/2900) of the ancestral genome would have been deleted
in the human lineage and about 73% retained.
The mouse genome has added at least 820 Mb of lineage-specific repeat sequence, but
this is likely to be an underestimate do to the difficulty of identifying older repeats in mouse
(section of the paper on repeats). We estimate the actual number is at least 10% higher, or about
900 Mb. The amount of segmental and tandem duplication in mouse is unknown, but it is
conservatively at least 1%, adding another 25 Mb. The alignable portion of the mouse genome
has shrunk by 5.6% as mentioned above, which is about 70 Gb if we assume about half is
retained. This gives a net increase of 900 + 25–70 = 855 Mb. Assuming the ancestor genome was
human-sized, this implies the mouse genome has shrunk by ~400 Mb while gaining 855 Mb.
Hence there would have been ~1255 Mb of deletion from the common ancestral genome. This
would correspond to a deletion rate of ~43.3% (=1255/2900) or a retention rate of 56.7%.
Assuming the deletions fixed in each lineage were random and uncorrelated, then the expected
proportion of the ancestral genome retained in both lineages would be ~41.4% (=73% x 56.7%).
Some of this ancestral genome (about 2% as above) will be duplicated in human and both
copies are expected to align to mouse. This means that the fraction of the human genome
expected to align to mouse will be approximately 2% larger, which is about 42.2%, similar to the
simple back-of-the-envelope estimation given in the text.
If instead we assume that the genome of the common ancestor was about the size of the
mouse genome, then the mouse large-scale deletion rate is estimated at 34.2% (=855/2500) and
the human at 15.6% (=(790-400)/2500), giving a proportion of the mouse genome-sized ancestor
of 55.5% (=65.8% x 84.4%). This is 1388 Mb, which with 2% duplications gives an estimate of
about 49% of the human genome that would be alignable to mouse.
Estimating deletion Comparison of ARs to their consensus sequence also allows a detailed
estimate of the rate of small insertions and deletions (indels) and a rough estimate to the overall
amount of DNA loss. From the consensus alignments it can be estimated that both species show a
net loss of nucleotides due to 1-10 bp indels, since deleted bases outnumber inserted bases by at
least 2-fold. The net loss is about 1.5-2.0% in human, but as much as 4.5-5.6% in mouse,
confirming early observations by Graur et al31.
To get some idea of larger scale deletions, one can assume for a subset of IRs (e.g., the
LTRs of retrovirus-like elements) that for each observed fragment once a complete element was
present in the genome. This is a fairly crude measure (e.g., completely deleted elements can not
be accounted for), but for 270 LTR elements in human and 150 in mouse, there is a decent linear
correlation between the estimated fraction lost and the substitution level of each family, giving
1.2 % and 1.5% sequence loss per % substitution, for human (R2 = 0.39) and mouse (R2 = 0.46)
respectively. This would estimate that since the human-mouse split, at least 20% of Mesozoic
non-functional DNA has been lost in human and over 50% in mouse (Fig SI 6).
In a different approach, we examined 8 orthologous genomic regions regions,
representing a total genomic extent of 34.7Mb in the mouse assembly. All of the regions were
chosen from finished areas of the human assembly with clear orthology to the mouse assembly.
Orthologous landmarks for each mouse/human region were identified, providing fine-scale
correspondence between the regions, much in the same the global synteny map was constructed.
Consecutive orthologous landmarks we used to define small orthologous windows along
each region. Specifically, we created 4 set of windows for varying minimum size-cutoffs: 0Kb,
1Kb, 2Kb, and 5Kb. Each pair of orthologous landmarks spanning a gap-free region in human
and mouse defines a window in the "0Kb" or no-cutoff set. Larger-cutoff windows were defined
by grouping consecutive landmarks into regions above the cutoff size in both genomes.
Within each window, ancient repeats were identified in both species using the methods
described elsewhere in the paper. The identification of repeats as ancestral is done independently
for each species and relies on the postprocessing of RepeatMasker output. We attempted to pair
each ancient repeat fragment found in one species with a corresponding fragment in the other
species using two different paring rules:
(1) pair all ancient repeat fragments belonging to the same family; and
(2) pair those repeats from the same family and occurring in the same orientation with respect to
the the repeat consensus sequence
The repeat family and relation orientation come from RepeatMasker annotations. As a result,
each ancient repeat base is categorized uniquely as:
"human only", "mouse only", or "common".
Pairing occurs only within the small orthologous windows.
The overall rate of deletion of ancient repeats is estimated as as the fraction of total mouse
ancient repeats (mouseOnly+common) that is present only in mouse (mouseOnly): (human
deletion rate)= (mouseOnly)/(common+mouseOnly)
See Table SI 5 for a summary of results.
As there are a limited number of repeat family types, and in particular LINE and SINE
elements make up a large fraction of the identified ancient repeats, there is a possibility of
mispairing repeats, particularly for larger windows. To gauge the extent of this background, or
random pairing, we have carried out negative control experiments in which the human repeat
annotations for each window are replaced by the annotations from a different (nonsyntenic) area
of the human genome. The anchors and window structure stays fixed, as does the mouse repeat
population; only the human repeat pattern changes. If the absence of any random pairing, then the
human deletion rate estimated from the synthetic regions would be 100%.
Instead we find estimated deletion rates of ~70% for the family-only pairing and ~75%
for the family-and-orientation pairing This clearly indicates that the pairing observed in the
orthologous regions is not attributable to statistical fluctuation.
Collecting 4D Sites. We collected a set of codons on the June 2002 assembly of the human
genome defined by BLAT30 alignments of 9,562 RefSeq cDNAs to the genome sequence that
passed certain quality checks at the genome level. In particular, we checked that (1) the human
CDS began with a start codon on the genome, ended with a stop codon, and had no in-frame stop
codons, (2) the human introns were GT/AG, GC/AG, or AT/AC, and (3) that there was blastzaligned mouse sequence at this genomic location identified by BLAT that had no in-frame stop
codons except in the last 20 codons of the human gene. For every gene that passed our quality
checks, we extracted the aligned human and mouse bases from the third position of any 4-fold
degenerate codon where the first two bases were conserved between human and mouse, i.e., at
sites marked 'x' in the codons GCx (ALA), CCx (PRO), TCx (SER), ACx (THR), CGx (ARG),
GGx (GLY), CTx (LEU), and GTx (VAL). This formed our collection of 4D sites.
Regional Estimates of Conservation Levels. To define the conservation score S for a given
small window (50 bp or 100 bp), we need to calculate the fraction of bases that are identical
between human and mouse from ancestral repeat (AR) sites around the given window. We do this
by locating 6,000 aligned AR sites around the window, i.e. rather than using a fixed-size region in
the human genome around the window, we use a variable-sided region with a fixed number of
sites in it. This helps to control the sample variance in the calculation. The average size of the
surrounding region ends up being roughly 100 kb. The number 6,000 was chosen to minimize the
empirical variance in the scores for all 100 bp windows in ancestral repeats genome-wide.
Obtaining Smooth Densities and Mixture Decomposition of Conservation Score. We employ
Gaussian kernel smoothers to produce the smooth density functions Sneutral and Sgenome, using
scores from 50bp windows with at least 15 aligned bases (only windows in ancestral repeats for
Sneutral and all such windows for Sgenome). To decompose Sgenome into a mixture p0* Sneutral + (1-p0)*
Sselected we determine the score at which the minimum of Sgenome/Sneutral is achieved, and set p0 to
this minimum value.
Measuring GC Content. To define the GC content of a 5-Mb window in the human genome, we
used only the sites in the window that were aligned to mouse, and counted the fraction that were
G or C. To define the difference between the human and mouse GC contents in the window, we
performed the same calculation on the aligned mouse sites and subtracted the result. As an
alternate to this method, we also tried using all human bases in the window, and all mouse bases
in a comparably sized region around the aligned mouse bases. The results were qualitatively
similar (data not shown).
Measuring Recombination Rate. The deCode genetic map markers32 were mapped to the June
assembly of the human genome by BLAT30. Placements with monotonically increasing genetic
distance across the chromosome were found for 4930 of 5136 markers. Each base in the genome
between two consecutive markers was assigned a recombination rate by linearly interpolating the
genetic distances of these markers. The recombination rate of a 5 Mb region was calculated as the
average of the recombination rates for each base in the region. Visual comparison to the plots
given in association with32 confirmed that this method gave similar results to the spline method
used in that study.
Genetic Variation Among Strains
WGS paired-end reads from 4-kb plasmids were generated from three strain chosen based on
recommendations from the Mouse Liaison Group. 119,232, 68,160 and 38,400 attempted reads
were produce from 129S1/SvImJ, C3H/HeJ and BALB/cByJ respectively. Vector- and qualitytrimmed reads were assessed for 20-base windows having an average quality score below Phred
20. The longest block of sequence between any two windows, if >250 bp, was passed onto SNP
discovery. SNP detection was performed by SSAHA-SNP 33 using 500 bp windows and a
maximum of 10 SNPs per window. The number of reads passing our quality thresholds, SSAHASNP, and post-SSAHA filtering for repetitive sequence were 67,974, 34,949, and 19,686 for
strains 129S1/SvImJ, C3H/HeJ and BALB/cByJ respectively.
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
Mural, R. J. et al. A comparison of whole-genome shotgun-derived mouse chromosome 16 and the
human genome. Science 296, 1661-1671 (2002).
Ma, B., Tromp, J. & Li, M. PatternHunter: faster and more sensitive homology search.
Bioinformatics 18, 440-445 (2002).
Hannenhalli, S. & Pevzner, P. in Proceedings of the 36th IEEE Symposium on Foundation of
Computer Science 581-592 (IEEE, Milwaukee, WI, 1995).
Tesler, G. Efficient algorithms for multichromosomal genome rearrangements. J Comp Sys Sci (in
press) (2002).
Tesler, G. GRIMM: genome rearrangements web server. Bioinformatics 18, 492-493. (2002).
Bourque, G. & Pevzner, P. A. Genome-scale evolution: reconstructing gene orders in the ancestral
species. Genome Res 12, 26-36 (2002).
IHGSC. Initial sequencing and analysis of the human genome. Nature 409, 860-921. (2001).
Gardiner-Garden, M. & Frommer, M. CpG islands in vertebrate genomes. J Mol Biol 196, 261282 (1987).
Ebersberger, I., Metzler, D., Schwarz, C. & Paabo, S. Genomewide comparison of DNA
sequences between humans and chimpanzees. Am J Hum Genet 70, 1490-1497 (2002).
Huchon, D. et al. Rodent phylogeny and a timescale for the evolution of Glires: evidence from an
extensive taxon sampling using three nuclear genes. Mol Biol Evol 19, 1053-1065 (2002).
Bulmer, M., Wolfe, K. H. & Sharp, P. M. Synonymous nucleotide substitution rates in mammalian
genes: implications for the molecular clock and the relationship of mammalian orders. Proc Natl
Acad Sci USA 88, 5974-5978 (1991).
Hubbard, T. et al. The Ensembl genome database project. Nucleic Acids Res 30, 38-41. (2002).
Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA. J Mol Biol
268, 78-94 (1997).
Kulp, D., Haussler, D., Reese, M. G. & Eeckman, F. H. Integrating database homology in a
probabilistic gene structure model. Pac Symp Biocomput, 232-244 (1997).
Guigó, R. et al. Comparison of mouse and human genomes yields more than 1000 additional
mammalian genes. submitted companion (2002).
Flicek, P., Keibler, E., Hu, P., Korf, I. & Brent, M. Comparative gene prediction in mouse and
human: From whole-genome shotgun reads to global synteny map. Genome Res submitted
companion (2002).
Pachter, L., Alexandersson, M. & Cawley, S. Applications of generalized pair hidden markov
models to alignment and gene finding problems. J Comput Biol 9, 389-399 (2002).
Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database
search programs. Nucleic Acids Res 25, 3389-3402 (1997).
Birney, E. & Durbin, R. Using GeneWise in the Drosophila annotation experiment. Genome Res
10, 547-548 (2000).
Yang, Z. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput
Appl Biosci. 13, 555-556 (1997).
Lowe, T. M. & Eddy, S. R. tRNAscan-SE: a program for improved detection of transfer RNA
genes in genomic sequence. Nucleic Acids Res 25, 955-964 (1997).
Zdobnov, E. M. & Apweiler, R. InterProScan — an integration platform for the signaturerecognition methods in InterPro. Bioinformatics 17, 847-848 (2001).
Creating the gene ontology resource: design and implementation. Genome Res 11, 1425-1433
(2001).
Tatusova, T. A. & Madden, T. L. BLAST 2 Sequences, a new tool for comparing protein and
nucleotide sequences. FEMS Microbiol Lett 174, 247-250 (1999).
Mott, R., Schultz, J., Bork, P. & Ponting, C. P. Predicting protein cellular localization using a
domain projection method. Genome Res 12, 1168-1174 (2002).
Yang, Z. & Nielsen, R. Estimating synonymous and nonsynonymous substitution rates under
realistic evolutionary models. Mol Biol Evol 17, 32-43 (2000).
27.
28.
29.
30.
31.
32.
33.
Thompson, J. D., Higgins, D. G. & Gibson, T. J. CLUSTAL W: improving the sensitivity of
progressive multiple sequence alignment through sequence weighting, position-specific gap
penalties and weight matrix choice. Nucleic Acids Res 22, 4673-4680 (1994).
Chiaromonte, F., Yap, V. B. & Miller, W. Scoring pairwise genomic sequence alignments. Pac
Symp Biocomput, 115-126 (2002).
Couronne, O. et al. Strategies and Tools for Whole Genome Alignments. Genome Research
submitted companion (2002).
Kent, W. J. BLAT — the BLAST-like alignment tool. Genome Res 12, 656-664 (2002).
Graur, D., Shuali, Y. & Li, W. H. Deletions in processed pseudogenes accumulate faster in rodents
than in humans. J Mol Evol 28, 279-285 (1989).
Kong, A. et al. A high-resolution recombination map of the human genome. Nat Genet 31, 241247 (2002).
Ning, Z., Cox, A. J. & Mullikin, J. C. SSAHA: a fast search method for large DNA databases.
Genome Res 11, 1725-1729 (2001).
Download