Additional methods and discussion Generating the Draft Genome Sequence Screening for contamination in the sequence. Decontamination of the sequence reads was carried out in multiple stages: (1) In initial partial assemblies, we looked for unexpected clustering and for plates of reads whose coverage by the rest of the assembly was anomalous (too low or too high). Such suspicious plates were aligned against probable contaminants. (2) After the final assembly, we discarded supercontigs whose reads tended to be plate-mates to reads residing in the bottom 5% of the assembly. More specifically: (a) We scored each physical plate, by assigning it the ratio (# of reads on plate whose supercontig has length ≥1 Mb) --------------------------------------------------------(# of reads on plate whose supercontig has length <1 Mb). (b) We scored each read, by assigning it the corresponding plate score. (c) We scored each supercontig, by assigning it the median of its reads' scores. (d) We discarded supercontigs having score <10. There were 446 such supercontigs, of which 66% had score 0. They contained a total of 11,850 reads, 81% of which were accounted for by the 18 largest of the discarded supercontigs, all of which had score 0. Note: We also checked for single-center contigs, having probability <10–6, but all were already discarded. (3) Delete supercontigs which have at least seven reads, all from the same library. Delete supercontigs that together with all supercontigs linked to them have at least seven reads, all from the same library. Repeat these steps on the assembly, prior to deletion of tiny supercontigs, and translate back. (4) Suppose that all the reads from a supercontig come from a single center, and moreover that all the reads in all the supercontigs it links to come from the same center. Then delete the supercontig. This was applied to the assembly prior to the deletion of tiny supercontigs, and the list was then translated to the final (reduced) assembly. (5) Remove suspected contaminants based on alignments to human sequence. This included looking for regions of the human genome that had too many mouse reads aligning to them. Almost all the contamination was removed by (1). The remaining steps removed a total of 1,902 contigs, totaling 3.6 Mb of sequence. Very little was removed by (5), although the method of (5) was used to tune the methods (2), (3), and (4). In short, the decontamination methods were almost entirely synthetic: based on internal inconsistencies rather than alignment to specific contaminants. In spite of these precautionary steps, we note that the unanchored part of the assembly is necessarily enriched for errors of many kinds. Genome size: Euchromatic genome size was estimated by looking at the scaffolds and captured gaps, which suggests a genome size of 2.5 Gb. A small fraction of the unanchored part of the assembly will also contribute to the euchromatic size of the genome. It is currently hard to estimate exactly how much of the unanchored sequence will fall in uncaptured gaps and at centromeric and telomeric ends that contribute to the euchromatic part of the genome, but we believe the majority will fall in captured gaps or in heterochromatic regions. Thus, we suggest that the genome size is 2.5 Gb or slightly larger. Comparison to Mural et al. Chromosome 16. Finished sequence used for comparison: B6 BACs AC079043, AC079044, AC083895, AC087541, AC087556, AC087840, AC087899, AC087900, AC098735, AC098883, and 129 BACs AC000096, AC003060, AC003062, AC003063, AC003066, AC005816, AC005817, AC006082, AC008019, AC008020, AC010001, AC012526. We note that the BACs used for evaluation purposes in Mural et al.1 were actually from Chromosome 6. Conservation of Synteny Between Mouse and Human Genomes Identification of orthologous landmarks. Full genomic alignments of the masked mouse (MGSCv3) and human (NCBI build 30) assemblies were carried out using the PatternHunter program2. Only those alignments that were: (1) high scoring, i.e., scoring ≥40 according to a standard additive scoring scheme: match = +1, mismatch = –1, gap open = –5 gap extend = –1; (2) bidirectionally unique at this scoring threshold; were used to identify orthologous landmarks in both genomes. Identification of syntenic blocks and segments. We first identified syntenic blocks; from these we derived the collection of syntenic segments. Geometrically, syntenic blocks correspond to rectangular regions in the mouse/human dot plots, while segments are curves with clear directionality within each rectangle. Syntenic blocks are therefore defined by interchromosomal discontinuities, while syntenic segments are determined by intrachromosomal rearrangements, typically inversions. A syntenic block of size X: (1) is a contiguous region of at least size X along a mouse chromosome that is paired to a contiguous region in a human chromosome, also of size X or larger; (2) for which all interruptions by other chromosomal regions (in either genome) are less than size X. Size can be measured either in terms of genomic extent in bases or as the number of consecutive orthologous landmarks. Our methodology for constructing syntenic blocks constructs low-resolution blocks (large cutoff) from high-resolution blocks. For example, at the highest resolution possible, every anchoring alignment is allowed to define or interrupt a syntenic block. To then obtain blocks defined by at least two consecutive landmarks in both genomes, singletons in either genome would be identified in the highest-resolution list and absorbed into pre-existing larger blocks. In a similar manner, our methodology will coalesce smaller blocks into larger blocks for any size cutoff while keeping segment boundaries as stable as possible. In our algorithm, one genome is selected as the reference genome and determines the order in which the blocks are listed. However, the blocks themselves are independent of the choice of reference genome. In fact, changing reference frame from mouse to human provided a non-trivial consistency check on the construction of the syntenic blocks. A syntenic segment of size X in mouse: (1) is always contained within a syntenic block; (2) exhibits clear directionality, with at least four successive markers in strictly increasing or decreasing order in both genomes; (3) is interrupted only by segments smaller that X in mouse; Note that there is no size restriction placed on the corresponding human extent. Intrachromosomal rearrangements within syntenic blocks were grouped into syntenic segments by aggregating at successively coarser scales. More care is required when coalescing segments, compared to blocks, to ensure that the resulting segments are truly reciprocal (one mouse region paired to only one human region and conversely). When defining segments, we excluded isolated outliers that seemed likely to be attributable to misassemblies or sequencing errors, a typical case being a single misplaced BAC. However, the fate of every syntenic landmark, including apparent outliers, was kept as part of a 'syntenic roadmap' to facilitate the coordinated, simultaneous navigation of both genomes. Further details and dot plots can be found at: http://www-genome.wi.mit.edu/mouse/synteny/index.html Estimation of the minimal number of rearrangements. The estimate of the number of rearrangements is based on the Hannenhalli-Pevzner theory for computing a most parsimonious (minimum number of inversions) scenario to transform one uni-chromosomal genome into another. This approach was further extended to find a most parsimonious scenario for multichromosomal genomes under inversions, translocations, fusions, and fissions of chromosomes3,4. We used a fast implementation of this algorithm (5, available via the GRIMM web server at: http://www-cse.ucsd.edu/groups/bioinformatics/GRIMM/index.html) to analyze the human– mouse rearrangement scenario. Although the algorithm finds a most parsimonious scenario, the real scenario is not necessarily a most parsimonious one, and the order of rearrangement events within a most parsimonious scenario often remains uncertain. Availability of three or more mammalian genomes could remedy some of these limitations and provide a means to infer the gene order in the mammalian ancestor6. The key element of the Hannenhalli-Pevzner theory is the notion of the breakpoint graph, which captures the relationships between different breakpoints (versus the analysis of individual breakpoints in previous studies). The breakpoint graph provides insights into rearrangements that may have occurred in the course of evolution. Some of these rearrangements are almost 'obvious', while others involve long series of interacting breakpoints and provide evidence for extensive breakpoint re-use in the course of evolution. 5. Genome Landscape GC Content. The human genome NCBI build 30 assembly and mouse genome MGSCv3 assembly were taken as genomic sequence. Both include runs of Ns for gaps in the sequence of known, estimated, or unknown length, including centromeric sequence. For analyses done on a genome wide or chromosomal scale, GC content was measured as the total number of G or C bases in the genome or chromosome divided by the total number of non-N bases in the same sequence. For windowed analyses (histograms, correlations), each genome sequence was broken into non-overlapping, abutting windows of fixed size (20 kb for GC distributions, ~100 kb for syntenic windows, 320 kb for correlation with gene density) starting at the centromere for acrocentric chromosomes and at the distal end of the p arm for metacentric chromosomes. All windows are of identical size except the last window on the distal end (or distal q end) of each chromosome, which contains the remainder bases regardless of number. Windows were analyzed for GC content without regard to number of non-N bases; however, any window with fewer nonN bases than 50% of the nominal window size was eliminated to prevent artificially high variance in the distribution. This eliminated no more than 1.7% of the non-null windows (2-7% of windows depending on organism and window size consisted entirely of Ns as placeholders for centromeres) or 0.7% of the total non-N bases for any organism/window combination and never changed the global GC content value by more than 0.01%. The actual average number of non-N bases per remaining window was 19155 for mouse and 19872 for human for 20 kb windows and 300825 and 314366 respectively for 320 kb windows. Analysis of GC content of syntenic regions started with the high quality bidirectionally unique anchors within syntenic segments (see above). Windows were selected as for the single genome analysis, starting at the centromere of each mouse chromosome. Regions where no clear synteny was present were skipped. We then selected non-overlapping, abutting windows which were exactly 100 kb in the mouse sequence and interpolated the equivalent human position using syntenic anchors and the average anchor spacing over the region. Due to small inversions in the syntenic anchor order, some regions in the human may overlap. Actual average size of human windows was 110 kb, as expected from the distribution of syntenic anchors. These regions were then analyzed for GC content separately in each organism as described above. Pairs of windows in which either organism had fewer than 50,000 non-N bases were discarded, effectively eliminating all regions which were 2-fold or more shorter in human than mouse. Regions which were 2-fold longer in human were not eliminated, but account for only 3% of windows. For binned analyses of syntenic GC content, one organism was taken as the reference organism and all of its windows binned in 1% increments centered on an integral percent GC (i.e., 39.5–40.5). The GC distribution statistics of the second organism were then calculated by window using all windows syntenic to each bin in the reference. No attempt was made to adjust for different sizes and fractions of non-N bases in the windows. For correlation of GC content with gene density, we took the ensembl sets of mouse and human gene predictions (mouse release 7.3b.2, July 12, 2002 and human release 8.30.1, September 2, 2002). This gave us 22,444 mouse genes and 22,920 human genes (60 human genes from the full ensembl set could not be used because they were predicted on unlocalized contigs not included in the NCBI genome build 30). These genes were then assigned to the same 320-kb bins in which GC content had been measured. If a gene spanned more than one bin it was fractionally assigned proportional to the fraction of its total transcript length which lay within that bin (so the total of all genes in all bins is the total number of genes, but bins may contain fractional numbers of genes). CpG Islands. CpG islands were identified on masked versions of MGSCv3 and NCBI 30 using a modification of the program used in7 (K. Worley and L. Hillier, personal communication). This program uses the definition of CpG islands proposed by Gardiner-Garden and Frommer8 of 200 bp containing ≥50% GC and ≥0.6 ratio of observed to expected CpG sites (based on local GC content). The calculations were also run independently varying minimum GC from 46–54%, o/e from 0.4–0.8, and length from 100–400 bp. While parameter shifts in o/e and length requirements significantly altered the total number of islands found in each organism, there was negligible effect on the ratio of islands between the two organisms. Changes to the minimum GC resulted in a very small change in the number of islands found, as the vast majority of islands in both organisms significantly exceed this threshold. Expansion ratio. Syntenic windows were determined as above. The ratio mouse/human was calculated for all windows. Windows with a ratio <0.25 or >4 were excluded from calculations/plots. 6. Repeats Additional legend Figure 10. Age distribution of Interspersed Repeats (IRs) in the mouse and human genomes. Bases covered by each repeat class were sorted by the estimated substitution level from their respective consensus sequences. Divergence levels from the RepeatMasker output were adjusted to account for 'mismatches' resulting from ambiguous bases in the consensus and genomic sequences. Often sequencing gaps represented by strings of 100– 500 Ns were overlapped by the matches, which would lead to huge overestimates of the divergence levels if not adjusted for. Since CpG->TpG transitions are about 10-fold more likely to occur than all combined substitutions at another site, repeats with many CpG sites (like Alu) are more diverged than those of the same age with few CpGs. We estimated the divergence level excepting CpG->TpG transitions (Drest) from the adjusted observed divergence level (Dobs) and the CpG frequency in the consensus (Fcg) by Drest = Dobs/(1+9Fcg), with a minimum Drest of DobsFcg. The substitution level K (which includes superimposed substitutions) were calculated with the simple Jukes-Cantor formula K = –3/4ln(1–4/3D rest)). Panels (b) and (d) show them grouped into bins of approximately equal time periods. On average, the substitution level has been 2-fold higher in the mouse than in the human lineage (Table 6), but currently may differ over 4-fold. Compared to the previous version, the scale on the X-axis in panel B is larger, as we estimate in this paper that the substitution level in mouse since the human-mouse speciation is at least 35%. Also, the time periods in panels (b) and (d) are smaller, assuming a speciation time of 75–80 Mya, rather than 100 Mya. Additional legend Table 6. Divergence levels of 18 families of IRs that shortly predate the human-mouse speciation. Their copies are found at orthologous sites in mouse and human while having a relatively low divergence level or representing the youngest members in the evolutionary tree of LINE1 and MaLR elements. Shown are the number of kilobases matched by each subfamily (kb), the median divergence (mismatch) level of all copies from the consensus sequence (div), the interquartile range of these mismatch levels (range), and a Jukes-Cantor estimate of the substitution level to which the median divergence level corresponds (JC). The two right columns contain the ratio of the JC substitution level in mouse over human, and an 'adjusted ratio' of the mouse and human substitution level after subtraction of the approximate fraction accumulated in the common human-mouse ancestor. Many factors influence these numbers. For example, AT-rich LINE1 copies appear less diverged than the GC-richer MaLR and DNA transposon families of the same age, primarily because GC->AT substitutions are more common AT->GC substitutions, especially in the AT-rich DNA where most LINE copies reside7. Early rodent specific L1 and MaLR subfamilies are not yet defined, so that their copies were matched to the consensus sequences in the table (note that the youngest L1 subfamily, L1MA6, has relatively much DNA matched to it). The associated, unduly high mismatch levels (L1 evolves faster than the neutral rate!) will increase the rodent median and the substitution level ratio. On the other hand, inaccuracies in the consensus and not-represented minor ancient subfamilies contribute equally to the observed mismatches in both species and cause the ratio to be smaller. Three more-important factors cause a significant underestimate of the substitution level in mouse compared to human. First, part of the substitutions in older families has accumulated in the common ancestor. The difference in substitution level between the family and the least diverged family in the class estimates this fraction, and is subtracted before calculating the ratio in the last column of the table. Second, by assuming that all substitutions are equally likely, the Jukes-Cantor formula significantly underestimates the number of superimposed substitutions at higher divergence levels. For example, when considering substitution patterns, 30% mismatches to an average DNA sequence in an average environment correspond to a 41% rather than 37–38% substitution level. Finally, there undoubtedly is a bias of ascertainment for the least diverged copies of the repeat family in mouse. Dependent on the length of the match, 30–35% mismatch level is about the maximum that can be detected by RepeatMasker, so that the more diverged copies are not tallied. The above suggest that the ratio of substitution rate in the lineages to human and mouse is at least 2.0-fold. Higher substitution rate in mouse lineage. We know that human and chimp are ~ 6 Mya separated and have a 1.25% substitution level 9, or 0.21% per My. This is likely to be even lower in the longer-generation-time human branch. Mouse and rat split ways 10-20 Mya (14 Mya used by Huchon et al.10) and show a 10-12% substitution level11, giving 0.5-1.2% substitutions per My, or a 2.5-6 fold difference in substitution rate per year between mouse and human. Mouse Genes Gene Build. We used 7 gene building systems – Ensembl 12 with additional proteins from the Riken cDNA collection which had ORFs > 100aa (23,026 genes), Fgenes++ (37,793 genes), the NCBI pipeline, which included Genomescan13 predictions (46,158 genes), Genie14 (18,548 genes), Ensembl EST only build (46,646 genes), SGP15 (48,451 genes), Twinscan16 (48,462) and SLAM17 (14,006 genes). These sets were had a core set of around 20,000 genes which nearly all the methods predicted at least partially, but the methods varied outside this set. After a intensive examination of the results we rationalised this set by taking the union of Ensembl and Genie predictions as our starting point, as these methods were most confirmed by the other methods. We then accepted any transcript predicted by the other methods if more than 80% of the Ensembl or Genie transcript was contained within it, and substituted the longer transcript. The final set of transcripts were clustered by single linkage clustering on the basis of one overlapping exon into “gene sets”, some of which will represent more than one recognisable gene. In our hands this procedure was a good balance between coverage and specificity, in particular for the downstream protein analysis. We produced 29,201 transcripts grouped into 22,011 gene sets. Orthology. We took matched Ensembl builds for human (build 7_29a) and mouse (7_3) which were built from the same set of protein databases. Because of this symmetry we expect if a gene is conserved between mouse and human that either it will be present in both genomes or neither. We then took the longest transcript in each gene, used reciprocal best hit via BLAST to identify our starting ortholog pairs. Three ortholog pairs in order within 100KB of each other provided intial syntenic regions that we then grew out along the chromosomes, allowing minor rearrangements and misplaced genes as long as there was an additional gene pair consistent with the synteny within 100KB. This gene level synteny map was in broad agreement with the DNA level synteny map. Into the resulting synteny map we placed cases with ambiguous best hits where this was resolved by the synteny and potential duplications. Threee classes of genes remained at the end of the analysis, being apparent local duplicates of genes (5,431) unmatched genes in the mouse genome (1416), or genes with matches that disagreed with the synteny (2705). These cases were subjected to a second round of analysis. For unmatched mouse cases, a second BLAST search on the human genome was performed, and then the main supporting evidence was classified for this gene. We discarded genes with a poor (<150 Bits) match (this step is to remove certain classes of weak prediction artefacts) and genes whose only support was from the early Riken cDNA project or had retroviral elements. The resulting 141 proteins were then manually examined and showed believable biological signals, such as being involved in olfaction. For the apparent local duplicates we took a sample of 50 genes and classified them as being fragmented copies (10%), real genes (47%), local pseudogenes (usually only one or two exons) (36%) or misprediction/unclassified (17%). For the misplaced genes we also took a sample of 50 genes and examined them by hand; 76% were clear pseudogenes, 16% apparently real genes, 5% dubious gene prediction and 3% misassembly (in all cases human). We applied these estimates to the complete pools in each set to provide the overall numbers of pseudogenes (4,010), gene prediction artefacts but likely to be real genes (543), gene predictions solely with Riken support (591), other artefacts (838), real duplicates (2552) and real apparently unsyntenous genes (432). Pseudogenes. The mouse pseudogene collection was obtained with a combination of sequence similarity searches and a scale up of a methodology that uses the ratio of non-synonymous versus synonymous substitutions (KA/KS) as an indicator of the absence of functionality. All repeatmasked regions between mouse genes were compared using BLASTx18 against a non-redundant protein database comprising EMBL CDS translations + PDB + SwissProt + PIR annotations (NRDB). The regions with significant sequence similarity (BLAST E value < 0.001) to (non-viral and non-transposon) proteins were aligned to the closest protein match using Genewise19. In order to confirm consistency and to determine the parentage, each aligned region was again compared to NRDB and to the complete mouse gene collection. For the remaining regions we inferred its hypothetical 'ancestral' sequence with PAMP (PAML package20) using the cDNAs of the two closest non-identical proteins of NRDB. The codon-based alignment of each region with its 'ancestral' sequence was then used to calculate the associated KA/KS ratios. The functionality test was done by comparing with the 'least squares fitting' procedure the resulting KA/KS distribution with two other KA/KS distributions obtained from reliable functional (2000 non-redundant full length Riken cDNAs) and pseudogenic (processed and disrupted intergenic regions) training sets. Estimation of the mammalian gene count from evidence based sets. From 128,847 exons with support from the Riken set, 106,575 exons were also predicted by the initial gene build, giving an estimated sensitivity of the Ensembl build at 79%. To estimate the specificity we took the estimated numbers of pseudogenes and mispredictions from the orthology analysis, and using an average number of exons/gene of 3 estimate the number of miscalled exons around 15,000 i.e., a specificity of 92%. We examined other routes to estimate the sensitivity and specificity (e.g., at the gene level, not the exon level) and all methods suffer from getting a sensible estimate of the specificity of predictions. In our hands, exon level calculations are more robust as the number of exons per gene vary dramatically across different gene prediction artefacts, making predicted gene numbers a poor estimator of total genes. RNA Genes. ncRNA annotation was performed essentially as described7. tRNA genes were detected using tRNAscan-SE 1.2321. Other genes and pseudogenes were detected using WUBLAST 2.0 (15 April 2002 release; W. Gish, unpublished). Mouse Proteome Amino acid sequences derived from the Mouse Gene Set were compared with protein and family databases. The comparison with all other known protein sequences used BLASTP18 and the NCBI non-redundant database (ftp://ftp.ncbi.nlm.nih.gov/blast/db/nr). Homology was inferred for all alignments with expect values < 10–4. Taxonomic groupings were taken from the NCBI taxonomy database (http://www.ncbi.nlm.nih.gov/Taxonomy/). Mouse proteins were also compared with the InterPro classification of domains, motifs and proteins using InterProScan22. An identical procedure was used for the annotation of human proteins from the Ensembl4.28.1 set. The assignment of Gene Ontology (GO) terms 23 to mouse and human proteins was derived from InterproScan output (ftp://ftp.ebi.ac.uk/pub/databases/interpro/interpro2go). Sequence comparisons of orthologues used Ensembl versions 6.3a (mouse) and 5.28 (human). All accessions cited refer to these versions. For each of the 12845 Ensembl 1:1 orthologous gene pairs, the mouse and human transcripts which had the highest local alignment scores were selected, and their protein sequences aligned using BLAST-2-Sequences24 and the BLOSUM80 substitution matrix. Their corresponding protein-coding DNA sequences were subsequently aligned in accordance with the protein sequence alignments. Mouse-human orthologue pairs were subdivided into different classes according to predicted subcellular localization25, enzymatic activity and the presence or absence of InterPro domains. InterPro domains whose predicted loci in mouse and human did not overlap by more than 80% of the aligned residues were ignored. This information was collated within a PostgreSQL (http://www.postgresql.org/) relational database management system (‘Panda’) devised for this project (L. Goodstadt, et al., unpublished). KA and KS values were estimated using the method of Yang & Nielsen26 KA/KS ratios were only included for orthologue pairs whose KA and KS were both greater than zero. In calculating the percentage amino acid identity between two sequences, the number of amino acid identities was divided by the total number of alignment positions, including positions where one sequence was aligned with a gap. Homologous genes that are clustered in loci within the mouse genome were detected by comparing the protein sequence of each gene with those of five genes on either side using BLAST-2-Sequences 24and an expect value threshold of 10–4. For every gene of each cluster, closely similar human homologues were detected using BLASTP. Subsequently, human orthologues were assigned using the Ensembl orthologue pair set or by the construction of dendograms using Clustal-W and neighbour-joining methods27. Genome Evolution: Mutation and Selection Constructing the Alignments. Ancient interspersed repetitive elements (see Table 6) and tandem repeats were soft-masked in both sequences using RepeatMasker, and lineage-specific repeats (i.e., interspersed repeats believed to have inserted after the human-mouse split) were physically removed. Human sequence from the June 2002 assembly (http://genome.ucsc.edu) was divided into 1.01-Mb pieces starting every megabase in the human assembly, and mouse sequence was divided into 30-Mb segments. The following nucleotide substitution scores, described by Chiaromonte et al.28 were used throughout the alignment process. A C G T A 91 -114 -31 -123 C -114 100 -125 -31 G -31 -125 100 -114 T -123 -31 -114 91 For steps that permit gaps in the alignment, a gap of length k was penalized by subtracting 400 + 30 k from the sum of substitution scores. Our alignment program, called blastz, uses the familiar the three-step strategy used by Gapped Blast18; find short near-exact matches, extend each short match without allowing gaps, and extend each ungapped match that exceeds a certain threshold by a dynamic programming procedure that permits gaps. For short near-exact matches we adapted a very clever idea of Ma el al.2 that looks for runs of 19 consecutive nucleotides in each sequence, within which the 12 positions indicated by a '1' in the string 1110100110010101111 are identical. To increase sensitivity, we allow a transition (A-G, G-A, C-T or T-C) in any one of the 12 positions. Each such match was check to see if it can be extended in both directions to produce a gap-free alignment scoring at least 3000 when the sum of the substitution scores is multiplied by a factor between 0 and 1 that measures sequence complexity in the aligned region, as described by Chiaromonte et al.28. The aim of this factor is to make it harder for lowcomplexity regions (such as CACACA...) to match. Ancestral repeats were then unmasked and the ungapped alignments extended by a dynamic-programming strategy that permits gaps. These three steps were applied recursively to fill in gaps between adjacent syntenic aligned regions, where the separation between alignments was at most 50kb in each species, but using criteria that are much more sensitive, e.g., the gap-free alignment that seeds a gapped alignment was required to score only 2200 and the penalty for low-complexity regions was not imposed. Finally, the local alignments were filtered such that in the roughly 2.5% of the human genome where there were multiple regions in mouse aligning to the same region in human, only the most significant alignment was retained. Source code for our alignment program, blastz, as well as code for removing lineage-specific repeats and adjusting alignments to refer to the original sequence positions, can be down-loaded from http://bio.cse.psu.edu/. The code that we used to select one alignment in human regions that aligned to several locations in the mouse genome can be obtained from Jim Kent. The alignment was built on the 1000 CPU Linux cluster at UCSC. The alignments can be viewed via the genome browser at http://genome.ucsc.edu. Validating the Alignments. Three independent groups ran whole genome alignments in June, 2002. One group used PatternHunter with two weight 11 spaced seeds2. This ran in 28 CPU days. Another group used the Avid global aligner 29 (Bray et al., unpublished data) anchored to nucleotide BLAT30 hits (pipeline.lbl.gov). This approach took 20 CPU days. The third group ran blastz as described above, which took 516 CPU days. We analysed the alignments in detail on human chromosome 20. To help compare the alignments we took highly conserved, moderately conserved, and lightly conserved subsets of each alignment and compared them with each other. PatternHunter2 and blastz are both local aligners. Blastz alignments were generally a nearly perfect superset of PatternHunter alignments. Blastz covered 96%, 96%, and 99.5% of the highly conserved, moderately conserved, and lightly conserved PatternHunter alignments. Conversely PatternHunter covered 89%, 75%, and 26% of the blastz alignments. The Avid/BLAT approach incorporated a global aligner, and intersects blastz less heavily. Blastz covered 88%, 85%, and 80% of the Avid alignments at the three levels. Conversely Avid covered 87%, 81%, and 67% of blastz. Note that these results were on the April freeze of the human genome, and all groups have improved their methods since then. Table SI 7 shows an alignment of a highly conserved coding region. The less conserved regions are more difficult. Table SI 8 shows an alignment of a transposon relic that was inserted into the mammalian genome before the mouse/human split. In making the program sensitive enough to make alignments such as these we were afraid that we also might the program generate alignments between regions that did not share a common ancestor. As a control for this, we reversed (without complementing) the mouse genome, and repeated the mouse/human alignment. 0.16% of the human genome was covered by alignments of the reversed mouse genome, as opposed to 40% of the genome covered by the forward alignments. To get a sense for how much divergence blastz would tolerate while still making accurate alignments we constructed two synthetic test sets. One test set consists of 1000 base pairs of mouse DNA mutated by various degrees of base substitutions in silico but no insertions or deletions (indels). We ran blastz on the original DNA vs. the mutated DNA. For base identities of 60% and more in our tests blastz does not put any insertions or deletions into this alignment. When the base identity is 55% blast inserts an indel every 500 bases. When the base identity is 50% blastz does not produce any alignments in 20 of 20 tests. The second test set consists of the same DNA mutated by various percentages of deletions, but no substitutions or insertions. When the deletion rate reaches 3%, blastz introduces one substitution per 300 bases. At a deletion rate of 4% blastz introduces one substitution per 200 bases. At a deletion rate of 11% blastz introduces one substitution per 30 bases. At a deletion rate of 20% blastz only creates alignments in 8 of 20 tests. Genome-wide, the blastz mouse/human alignments have an average base identity of 70% and an insertion or deletion rate of just under 3%. Overall blastz alignments appear more likely to diverge from true homology when the indel rate is higher than average than when the substitution rate is higher than average. However the large majority of the mouse/human alignments occur at a level of sequence divergence where blastz appears to have a very low error rate. Estimating Amount of Human Genome Expected to Align to Mouse. A back-of-the-envelope estimate that about 42% of the human genome would be expected to align to mouse is given in the text. Here we present refinements on this calculation. To estimate turnover in the human and mouse genomes since their common ancestor, we assume that the euchromatic genome size of the ancestor was similar to the modern human genome at ~2.9 Gb. There would thus have been no net change in genome size in the human lineage and an ~400 Mb decrease in the mouse lineage. The change in size within aligning regions of the human genome from the size they were in the ancestor is small: we estimate that they are 2% smaller in human than they were in the ancestor. This estimate comes from alignments of ancestral repeat sequences to their consensus sequence. Such alignments show a net loss of 2% when one subtracts deletions from insertions. Mouse alignments show a larger net loss of about 5.6%. Other processes that cause the size to change are large-scale insertions, deletions, and duplications. The human genome has added ~700 Mb of lineage-specific repeat sequence since the common ancestor (see section of the paper on repeats). It has also added new sequence by segmental duplication, which, if you don’t count the lineage-specific insertions within the duplicated segments, adds another 2% or about 60Mb. Local tandem duplications are estimated to have added an additional 50 Mb. This gives a total amount of new DNA from these processes of about 810 Mb. This is offset by only about 20 Mb of shrinkage in the alignable portions inherited from the ancestor (anticipating that very roughly half will be found to be inherited). This gives a net increase of 790 Mb. This increase would have been offset by a roughly comparable amount of deletion. This implies that ~27% (=790/2900) of the ancestral genome would have been deleted in the human lineage and about 73% retained. The mouse genome has added at least 820 Mb of lineage-specific repeat sequence, but this is likely to be an underestimate do to the difficulty of identifying older repeats in mouse (section of the paper on repeats). We estimate the actual number is at least 10% higher, or about 900 Mb. The amount of segmental and tandem duplication in mouse is unknown, but it is conservatively at least 1%, adding another 25 Mb. The alignable portion of the mouse genome has shrunk by 5.6% as mentioned above, which is about 70 Gb if we assume about half is retained. This gives a net increase of 900 + 25–70 = 855 Mb. Assuming the ancestor genome was human-sized, this implies the mouse genome has shrunk by ~400 Mb while gaining 855 Mb. Hence there would have been ~1255 Mb of deletion from the common ancestral genome. This would correspond to a deletion rate of ~43.3% (=1255/2900) or a retention rate of 56.7%. Assuming the deletions fixed in each lineage were random and uncorrelated, then the expected proportion of the ancestral genome retained in both lineages would be ~41.4% (=73% x 56.7%). Some of this ancestral genome (about 2% as above) will be duplicated in human and both copies are expected to align to mouse. This means that the fraction of the human genome expected to align to mouse will be approximately 2% larger, which is about 42.2%, similar to the simple back-of-the-envelope estimation given in the text. If instead we assume that the genome of the common ancestor was about the size of the mouse genome, then the mouse large-scale deletion rate is estimated at 34.2% (=855/2500) and the human at 15.6% (=(790-400)/2500), giving a proportion of the mouse genome-sized ancestor of 55.5% (=65.8% x 84.4%). This is 1388 Mb, which with 2% duplications gives an estimate of about 49% of the human genome that would be alignable to mouse. Estimating deletion Comparison of ARs to their consensus sequence also allows a detailed estimate of the rate of small insertions and deletions (indels) and a rough estimate to the overall amount of DNA loss. From the consensus alignments it can be estimated that both species show a net loss of nucleotides due to 1-10 bp indels, since deleted bases outnumber inserted bases by at least 2-fold. The net loss is about 1.5-2.0% in human, but as much as 4.5-5.6% in mouse, confirming early observations by Graur et al31. To get some idea of larger scale deletions, one can assume for a subset of IRs (e.g., the LTRs of retrovirus-like elements) that for each observed fragment once a complete element was present in the genome. This is a fairly crude measure (e.g., completely deleted elements can not be accounted for), but for 270 LTR elements in human and 150 in mouse, there is a decent linear correlation between the estimated fraction lost and the substitution level of each family, giving 1.2 % and 1.5% sequence loss per % substitution, for human (R2 = 0.39) and mouse (R2 = 0.46) respectively. This would estimate that since the human-mouse split, at least 20% of Mesozoic non-functional DNA has been lost in human and over 50% in mouse (Fig SI 6). In a different approach, we examined 8 orthologous genomic regions regions, representing a total genomic extent of 34.7Mb in the mouse assembly. All of the regions were chosen from finished areas of the human assembly with clear orthology to the mouse assembly. Orthologous landmarks for each mouse/human region were identified, providing fine-scale correspondence between the regions, much in the same the global synteny map was constructed. Consecutive orthologous landmarks we used to define small orthologous windows along each region. Specifically, we created 4 set of windows for varying minimum size-cutoffs: 0Kb, 1Kb, 2Kb, and 5Kb. Each pair of orthologous landmarks spanning a gap-free region in human and mouse defines a window in the "0Kb" or no-cutoff set. Larger-cutoff windows were defined by grouping consecutive landmarks into regions above the cutoff size in both genomes. Within each window, ancient repeats were identified in both species using the methods described elsewhere in the paper. The identification of repeats as ancestral is done independently for each species and relies on the postprocessing of RepeatMasker output. We attempted to pair each ancient repeat fragment found in one species with a corresponding fragment in the other species using two different paring rules: (1) pair all ancient repeat fragments belonging to the same family; and (2) pair those repeats from the same family and occurring in the same orientation with respect to the the repeat consensus sequence The repeat family and relation orientation come from RepeatMasker annotations. As a result, each ancient repeat base is categorized uniquely as: "human only", "mouse only", or "common". Pairing occurs only within the small orthologous windows. The overall rate of deletion of ancient repeats is estimated as as the fraction of total mouse ancient repeats (mouseOnly+common) that is present only in mouse (mouseOnly): (human deletion rate)= (mouseOnly)/(common+mouseOnly) See Table SI 5 for a summary of results. As there are a limited number of repeat family types, and in particular LINE and SINE elements make up a large fraction of the identified ancient repeats, there is a possibility of mispairing repeats, particularly for larger windows. To gauge the extent of this background, or random pairing, we have carried out negative control experiments in which the human repeat annotations for each window are replaced by the annotations from a different (nonsyntenic) area of the human genome. The anchors and window structure stays fixed, as does the mouse repeat population; only the human repeat pattern changes. If the absence of any random pairing, then the human deletion rate estimated from the synthetic regions would be 100%. Instead we find estimated deletion rates of ~70% for the family-only pairing and ~75% for the family-and-orientation pairing This clearly indicates that the pairing observed in the orthologous regions is not attributable to statistical fluctuation. Collecting 4D Sites. We collected a set of codons on the June 2002 assembly of the human genome defined by BLAT30 alignments of 9,562 RefSeq cDNAs to the genome sequence that passed certain quality checks at the genome level. In particular, we checked that (1) the human CDS began with a start codon on the genome, ended with a stop codon, and had no in-frame stop codons, (2) the human introns were GT/AG, GC/AG, or AT/AC, and (3) that there was blastzaligned mouse sequence at this genomic location identified by BLAT that had no in-frame stop codons except in the last 20 codons of the human gene. For every gene that passed our quality checks, we extracted the aligned human and mouse bases from the third position of any 4-fold degenerate codon where the first two bases were conserved between human and mouse, i.e., at sites marked 'x' in the codons GCx (ALA), CCx (PRO), TCx (SER), ACx (THR), CGx (ARG), GGx (GLY), CTx (LEU), and GTx (VAL). This formed our collection of 4D sites. Regional Estimates of Conservation Levels. To define the conservation score S for a given small window (50 bp or 100 bp), we need to calculate the fraction of bases that are identical between human and mouse from ancestral repeat (AR) sites around the given window. We do this by locating 6,000 aligned AR sites around the window, i.e. rather than using a fixed-size region in the human genome around the window, we use a variable-sided region with a fixed number of sites in it. This helps to control the sample variance in the calculation. The average size of the surrounding region ends up being roughly 100 kb. The number 6,000 was chosen to minimize the empirical variance in the scores for all 100 bp windows in ancestral repeats genome-wide. Obtaining Smooth Densities and Mixture Decomposition of Conservation Score. We employ Gaussian kernel smoothers to produce the smooth density functions Sneutral and Sgenome, using scores from 50bp windows with at least 15 aligned bases (only windows in ancestral repeats for Sneutral and all such windows for Sgenome). To decompose Sgenome into a mixture p0* Sneutral + (1-p0)* Sselected we determine the score at which the minimum of Sgenome/Sneutral is achieved, and set p0 to this minimum value. Measuring GC Content. To define the GC content of a 5-Mb window in the human genome, we used only the sites in the window that were aligned to mouse, and counted the fraction that were G or C. To define the difference between the human and mouse GC contents in the window, we performed the same calculation on the aligned mouse sites and subtracted the result. As an alternate to this method, we also tried using all human bases in the window, and all mouse bases in a comparably sized region around the aligned mouse bases. The results were qualitatively similar (data not shown). Measuring Recombination Rate. The deCode genetic map markers32 were mapped to the June assembly of the human genome by BLAT30. Placements with monotonically increasing genetic distance across the chromosome were found for 4930 of 5136 markers. Each base in the genome between two consecutive markers was assigned a recombination rate by linearly interpolating the genetic distances of these markers. The recombination rate of a 5 Mb region was calculated as the average of the recombination rates for each base in the region. Visual comparison to the plots given in association with32 confirmed that this method gave similar results to the spline method used in that study. Genetic Variation Among Strains WGS paired-end reads from 4-kb plasmids were generated from three strain chosen based on recommendations from the Mouse Liaison Group. 119,232, 68,160 and 38,400 attempted reads were produce from 129S1/SvImJ, C3H/HeJ and BALB/cByJ respectively. Vector- and qualitytrimmed reads were assessed for 20-base windows having an average quality score below Phred 20. The longest block of sequence between any two windows, if >250 bp, was passed onto SNP discovery. SNP detection was performed by SSAHA-SNP 33 using 500 bp windows and a maximum of 10 SNPs per window. The number of reads passing our quality thresholds, SSAHASNP, and post-SSAHA filtering for repetitive sequence were 67,974, 34,949, and 19,686 for strains 129S1/SvImJ, C3H/HeJ and BALB/cByJ respectively. References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. Mural, R. J. et al. A comparison of whole-genome shotgun-derived mouse chromosome 16 and the human genome. Science 296, 1661-1671 (2002). Ma, B., Tromp, J. & Li, M. PatternHunter: faster and more sensitive homology search. Bioinformatics 18, 440-445 (2002). Hannenhalli, S. & Pevzner, P. in Proceedings of the 36th IEEE Symposium on Foundation of Computer Science 581-592 (IEEE, Milwaukee, WI, 1995). Tesler, G. Efficient algorithms for multichromosomal genome rearrangements. J Comp Sys Sci (in press) (2002). Tesler, G. GRIMM: genome rearrangements web server. Bioinformatics 18, 492-493. (2002). Bourque, G. & Pevzner, P. A. Genome-scale evolution: reconstructing gene orders in the ancestral species. Genome Res 12, 26-36 (2002). IHGSC. Initial sequencing and analysis of the human genome. Nature 409, 860-921. (2001). Gardiner-Garden, M. & Frommer, M. CpG islands in vertebrate genomes. J Mol Biol 196, 261282 (1987). Ebersberger, I., Metzler, D., Schwarz, C. & Paabo, S. Genomewide comparison of DNA sequences between humans and chimpanzees. Am J Hum Genet 70, 1490-1497 (2002). Huchon, D. et al. Rodent phylogeny and a timescale for the evolution of Glires: evidence from an extensive taxon sampling using three nuclear genes. Mol Biol Evol 19, 1053-1065 (2002). Bulmer, M., Wolfe, K. H. & Sharp, P. M. Synonymous nucleotide substitution rates in mammalian genes: implications for the molecular clock and the relationship of mammalian orders. Proc Natl Acad Sci USA 88, 5974-5978 (1991). Hubbard, T. et al. The Ensembl genome database project. Nucleic Acids Res 30, 38-41. (2002). Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA. J Mol Biol 268, 78-94 (1997). Kulp, D., Haussler, D., Reese, M. G. & Eeckman, F. H. Integrating database homology in a probabilistic gene structure model. Pac Symp Biocomput, 232-244 (1997). Guigó, R. et al. Comparison of mouse and human genomes yields more than 1000 additional mammalian genes. submitted companion (2002). Flicek, P., Keibler, E., Hu, P., Korf, I. & Brent, M. Comparative gene prediction in mouse and human: From whole-genome shotgun reads to global synteny map. Genome Res submitted companion (2002). Pachter, L., Alexandersson, M. & Cawley, S. Applications of generalized pair hidden markov models to alignment and gene finding problems. J Comput Biol 9, 389-399 (2002). Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25, 3389-3402 (1997). Birney, E. & Durbin, R. Using GeneWise in the Drosophila annotation experiment. Genome Res 10, 547-548 (2000). Yang, Z. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci. 13, 555-556 (1997). Lowe, T. M. & Eddy, S. R. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res 25, 955-964 (1997). Zdobnov, E. M. & Apweiler, R. InterProScan — an integration platform for the signaturerecognition methods in InterPro. Bioinformatics 17, 847-848 (2001). Creating the gene ontology resource: design and implementation. Genome Res 11, 1425-1433 (2001). Tatusova, T. A. & Madden, T. L. BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences. FEMS Microbiol Lett 174, 247-250 (1999). Mott, R., Schultz, J., Bork, P. & Ponting, C. P. Predicting protein cellular localization using a domain projection method. Genome Res 12, 1168-1174 (2002). Yang, Z. & Nielsen, R. Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. Mol Biol Evol 17, 32-43 (2000). 27. 28. 29. 30. 31. 32. 33. Thompson, J. D., Higgins, D. G. & Gibson, T. J. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22, 4673-4680 (1994). Chiaromonte, F., Yap, V. B. & Miller, W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput, 115-126 (2002). Couronne, O. et al. Strategies and Tools for Whole Genome Alignments. Genome Research submitted companion (2002). Kent, W. J. BLAT — the BLAST-like alignment tool. Genome Res 12, 656-664 (2002). Graur, D., Shuali, Y. & Li, W. H. Deletions in processed pseudogenes accumulate faster in rodents than in humans. J Mol Evol 28, 279-285 (1989). Kong, A. et al. A high-resolution recombination map of the human genome. Nat Genet 31, 241247 (2002). Ning, Z., Cox, A. J. & Mullikin, J. C. SSAHA: a fast search method for large DNA databases. Genome Res 11, 1725-1729 (2001).