Supplementary Information Generation of the chromosome sequences Both chromosomes were sequenced using a clone-by-clone shotgun sequencing strategy1 supported by the BAC-based whole genome physical map2. The quality of the chromosome 2 and 4 sequences was determined to exceed the 99.99% accuracy standard established by the International Human Genome Sequencing Consortium (IHGSC) for sequencing the human genome3. In addition to the centromeric gaps, there are 17 gaps in the chromosome 2 sequence and 12 gaps in the chromosome 4 sequence. Sequences extend into the centromere on both chromosomes and reach the p-arm telomeric sequence on both chromosomes. On 2q, the sequence reaches within 200kb of the telomere and within 10kb of the telomere on 4q4. Based on the size estimates of remaining gaps, the available sequence represents greater than 99.6% of the total euchromatic sequence (Supplementary Table 1). Attempts were made to close all remaining gaps first using probes derived from BAC-end sequences from clones flanking each gap and from available chimpanzee sequence (PanGSC, in preparation) against all available libraries (100-fold clone coverage of the genome, Supplementary Methods below) and second using sequence placement of fosmid paired end reads{Consortium, 2004 #94}. From the fosmid end placements, 76 clones were selected which added just over 500kb of sequence (now included in build35/hg17). Remaining gaps were sized using FISH and using comparative placement to mouse, rat and chimp. In only one case was additional chimpanzee sequence detectable that fell in the human gap region. Some of the remaining gaps are associated with segmental duplications (60%), local complex repetitive structures (20%), local increases in G+C content of 2 to 10% (50%), and locally high regions of G+C content (35%, with 10% >=55% G+C). The rapid local increases in GC content flanking gaps is consistent with the notion that GC-rich regions are difficult to clone5. The integrity of the sequence and its assembly was tested using a variety of methods. First, as each clone was finished, an in silico digest of the sequence was compared for verification purposes to restriction digests of the clone DNA. Second, we checked the fully assembled sequence by performing in silico digests of clone-sized fragments across the chromosome against the underlying fingerprint data used to construct the physical map. In this way, we directly confirmed greater than 99.99% of the testable bands. Third, while not providing conclusive evidence of a “problem” and many times only identifying polymorphism, we examined BACs, fosmid and plasmid6 paired end sequences to assess consistency. We searched for two or more ends clustered and missing, or spanning inappropriate distances. Based on BAC placements, eight suspicious areas were identified, but fosmid and plasmid data through those regions were supportive of the genome sequence. For the 13 areas with inconsistent fosmid pairing data, we sequenced each of them and only two suggested modification of the genome sequence. Five were conclusively determined to be polymorphic (Supplementary Table 9). Finally, we compared the order of the placement of BAC ends against the chromosome 2 and 4 sequences to the order of the BACs within the fingerprint map2. No inconsistencies were found. Orthology to mouse The relationship between the human chromosome 2 and 4 sequences and the mouse genome could be readily defined for ~94% and ~97% of the sequenced chromosomes, respectively. The largest defined segment (44.1 Mb) on chromosome 2 includes two distal-less homeobox containing genes (DLX1 and DLX2) and the HOXD gene cluster. The largest defined segment (21.7 Mb) on chromosome 4 includes a replication-independent member of the histone 2A family, H2AFZ. In mice, this particular histone functions in embryonic development and lack of H2AFZ leads to embryonic lethality. Supplementary mRNA information As described in the text, we examined discrepancies between our genomic sequence and known mRNAs that created frameshifts and/or truncations in the genomic sequence. In cases where there was no support for the mRNA sequence variant, we attempted to construct a gene that avoided the sequence discrepancy. In cases where the site was determined to be an error in our BAC sequence (updates to our BAC sequence have also been submitted) or polymorphic and without the base change we could not create a protein the amino acids for the following GenBank7 gi identifiers were added to the chromosome gene index, even though the BAC-based sequence did not support those translations. The following mRNAs were not included in the gene index for the reasons indicated: 19923278 (missing nucleotide causes frameshift, confirmed BAC error), 21314676 (missing nucleotide causes frameshift, confirmed BAC error), 23510369 (extra nucleotide causes frameshift, BAC sequence supported by PCR), 24850108 (two missing nucleotides causes frameshift, confirmed BAC error), 27735096 (extra nucleotide causes frameshift, repetitive region could not be sampled by PCR for verification), 4557854 (several missing nucleotides causes frameshift, BAC sequence supported by PCR screening), 6006032 (several missing nucleotides causes frameshift, BAC sequence supported by PCR screening), 8394043 (several missing nucleotides causes frameshift, BAC sequence supported by PCR screening), 11496888 (frameshift causes internal stop, ESTs suggest it is a genomic sequencing error), 23468353 (frameshift, confirmed BAC error), and 16876981 (frameshift, confirmed BAC error). One mRNA, gi17572816, spans across from chr4 to chr4_random. In the hg17/build35/May, 2004 release of the chromosomal sequence, this problem has fixed. One mRNA, gi21314625, was found to be incomplete (80 bases missing in the middle of the gene, a confirmed deletion in the BAC sequence which is found in an overlapping BAC.). This has been corrected in the May, 2004, release. Two additional mRNAs (gi9507164 and gi5031798 as well as the TWIST2 gene (gi|17981707|ref|NM_057179.1|, gi|21620050|gb|BC033168.1|, gi|17389790|gb|BC017907.1| gene which lies in the gap of NT_005120 and NT_022173) spanned across a sequence gap with some exons falling in the gap and were not included in the gene set. For gi9507164, although the gap still exists in the May, 2004, release, we have extended sequence into that gap and can now account for all exonic sequence. Based on placements of mRNAs against chromosome 2 and 4, only one possible deletion was detected and that one was on chromosome 4. The mRNA, BC029568, appears to indicate a deletion in AC018687. It was found completely in AC018692, a finished clone that was not used in build34. It is basepair 1-176 of that mRNA that are not in AC018687. The data in AC018692 is currently represented (in build35) in chr4_random. In the next release of the human genome, we will update the main chromosome 4 sequence to include the data from AC018692. Additionally, alignment with the mRNA NM_005277 indicates a 156 and a 236 basepair deletion in the clone AC093819.3. These were verified by alignment with an overlapping BAC clone and have been correctly updated in GenBank. As described in the main text, but here provided in greater detail, we also investigated 161 mRNA regions where the matched genomic sequence contained substitution or insertion/deletion differences from the mRNA, 83 of which caused a frameshift and/or truncation of the protein product. To determine the origin of the difference, we re-sequenced the region of interest in a panel of 24 diverse individuals8, in the starting BAC and in some cases in overlapping BACs. A total of 78 substitution/triplet insertion discrepancies were examined (only 2 of the 78 were triplet insertion; one was polymorphic and in the other case all individuals agreed with the BAC). In eight cases, primers could not be chosen because the sequence was too repetitive. In eight cases, all genomic samples agreed with the BAC suggesting an error in the mRNA or a highly rare polymorphism. In two cases, all individuals matched the mRNA, but the sequence of the control DNA matched the BAC and other underlying BACs could not be sampled. Only one was determined to be an error in the BAC. A total of 44 were found to be polymorphic in the population of 24 individuals. An additional 15 were found to not be polymorphic within the panel of 24 individuals but of those 11 were polymorphic within the RP11 library and in four cases all overlapping RP11 clones sampled agreed with the BAC sequence. Of the 83 possible frameshifts, two were too repetitive to obtain data. Eight were found to be errors in the genomic sequence. In 69 of the 84 frameshift cases, the sequence in the 24 genomic samples agreed with the sequence of the BAC again suggesting errors in the underlying mRNA. Of those, 54 were simple single base insertion or deletion errors in the mRNA that shifted or truncated the reading frame in the cDNA relative to the genomic sequence. The other 15 had multiple base insertion/deletion differences such that the frame was eventually restored. In such cases, comparison with the related mouse confirmed the genomic translation with a more conserved match between mouse with the genomic translation than with the original mRNA translation. Most interestingly, four were determined to be polymorphic in the population (Table 1; Supplementary Table 5). The first polymorphic example is found on chromosome 2 in the protein PMS1 (postmeiotic segregation increase 1), identified by its homology to a yeast protein involved in DNA mismatch repair. Some cases of hereditary nonpolyposis colorectal cancer are associated with mutations in this gene. This particular single base deletion event in the BAC with respect to the mRNA causes a change of frame (4 amino acids before the stop codon) and extension of the open reading frame by an additional 15 amino acids. Another PMS1 mRNA (BC036376) avoids the genomic region of this frameshift with an alternative splicing event 71 basepairs upstream. Similarities to predicted mouse proteins end at amino acid 149, and this frameshift discrepancy is at amino acid 163. The polymorphic frameshift in the final exon of the PAS domain containing serine/threonine kinase (PASK/ NM_015148) leads to a longer open reading frame. The amino acids extending past the stop indicated by NM_015148, however, does not match the orthologous predicted gene in the mouse genome. The third polymorphism was identified in ancient ubiquitous protein 1 (AUP1/ NM_012103), one of three isoforms referred to as the longest isoform at 476 amino acids. The frameshift modifies the protein beginning at position 147 and then truncates the protein with a stop codon at amino acid 149. The similarity to mouse corresponds with an alternate isoform and does not extend through this region (ends at aa 113). A fourth polymorphism, an insertion of four basepairs in the genomic sequence relative to the mRNA, was identified on chromosome 4 in TSARG2, testis and spermatogenesis cell related protein 2 and causes truncation of the protein at amino acid 281 out of 305 (frameshift occurs at amino acid position 279). Two underlying mRNAs agree with the genomic data and read through the region of this 4 basepair deletion without a frameshift, confirming the polymorphism. The similarity to the related mouse protein (ENSMUST00000033917) extends through the region of the frameshift (to amino acid 287) indicated by the mRNA sequence. In addition to the frameshifts detected by comparisons of the mRNA with the genomic DNA, we also detected possible frameshifts by comparing all genomic coding regions to the random genomic shotgun data available through the HAPMAP project6. We detected 28 additional potential high quality frameshifts in coding regions of REFSEQ9 or MGC10 mRNAs on chromosome 2, only 6 of which were annotated in dbSNP11. On chromosome 4, we detected 15 possible frameshifts in coding regions of REFSEQ mRNAs, only one of which was annotated in dbSNP. In all cases but one, the frameshift led to truncation of the protein. This suggests an additional four possible polymorphic frameshifts or BAC errors where the chimp sequence agreed with the mRNA, each of which will require further validation. Supplemental non-coding RNA information The lower than expected density of ncRNA genes on these two chromosomes may reflect the fact that many ncRNA genes are strongly clustered, such as the large array of 18 cysteine tRNAs at 7q36 (Hillier et al., 2003), the tandem array of 5-25 U2 snRNAs at 17q21 (Pavelitz et al., 1999), or the complex array of ~30 U1 snRNA genes and tRNAs at 1p36 (Lindgren et al. 1985). Chromosomes 2 and 4 appear to lack any large clusters of this sort. There are seven small clusters of 2-3 genes each within <100 kb of each other: two tRNA pairs, one pair of miRNAs, and four sets of snoRNAs in the introns of coding genes (2 in EF-1-beta, 3 in nucleolin, 2 in APG16L, and 2 in ribosomal protein S3a). Supplemental Segmental Duplication Analysis Segmental duplications have long been noted for their potential role in the evolution of new genes12. To examine the transcriptional and coding potential of duplicated regions, we analyzed a non-overlapping set of known genes with or without additional spliced EST. For each group, we categorized every exon as unique or duplicated on the basis of its overlap with duplicated sequence (Supplementary Table 10). About 5.49% (1162 out of 21164) of exons of chromosome 2 are duplicated, while only 2.47% (277 out of 11213) of exons of chromosome 4 are duplicated. Our analysis shows that the relative number of transcribed exons is slightly greater for duplicated DNA when compared with non-duplicated DNA on chromosome 2 and 4. These results support a previous observation13 that recently duplicated regions are rich in genes/transcripts. Supplementary Methods Physical Mapping The human libraries listed here were probed for filling gaps in the chromosome 2 and 4 maps. In addition, we have probed against filters from the RP43 chimpanzee library. Library Estimated Coverage RPCI-4 RPCI Male PAC 4x RPCI-5 RPCI Male PAC 6x RPCI-11 Segments 1-5 Male BAC RPCI-13 Segments 1-4 Female BAC Genome Systems BAC library 1 32.2x14,15 21.8x 10x Cal Tech Segments A, B 13x16 Cal Tech D 17x16 Broad/MIT fosmid library (H_AA) 15x MRC Geneservice single chromosome cosmids/fosmids: LL02NC02 – chromosome 2 4x LL02NC03 – chromosome 2 2.6x LA04NC01 – chromosome 4 3x RP libraries : Cal Tech libraries : http://www.chori.org/bacpac/ http://informa.bio.caltech.edu/idx_www_tree.html Genome Systems libraries: http://www.genomesystems.com/ MRC libraries : http://www.hgmp.mrc.ac.uk/geneservice/reagents/products/descriptions Sequence and Assembly Validation Each chromosome sequence was segmented into 150-kb fragments each with 40% overlap with the previous simulated clone. An in silico HindII digest (ISD) of these simluated clones was performed; bands less than 600 bp were removed from consideration to be consistent with the fingerprint data. Fingerprint data from the human genome physical mapping effort2 were converted from mobilities to sizes so that clones from the fingerprinting effort could then be directly compared to the chromosome fragments. The ISD of each 150-kb simulated clone (2,662 clones for chromosome 2; 2,094 for chromosome 4) was then compared to the overlapping clones in the fingerprint map. We required that each band in the simulated clone be confirmed by no more than two of the overlapping fingerprints. A band was considered confirmed if the difference between the size of the fragment in the simulated clone varied by less than 3% from a band in the fingerprints under consideration. For bands between 600 and 700 bp (N=1828 bands in chromosome 2 are in this range; N=1438 bands in chr 4) and for bands greater than 11-kb (N=2119 chromosome 2; N=1424 chromosome 4), we allowed the discrepancy to be <4% given the slightly higher variability of the fingerprint gels in those ranges17. Only 405 and 226 additional bands on chromosome 2 and 4 respectively were confirmed using the 4% threshold however rather than the 3% threshold. An ISD of the full chromosome 2 sequence revealed 71,331 total bands of which 58,741 were greater than 600 bp. For chromosome 4, there were 58,729 total bands and 48,002 bands greater than 600 bp. Using the above method on chromosome 2, only 6 bands from 5 clones (2 of the clones were at contig ends) were not confirmed. For chromosome 4, there were only 6 unmatched bands (from 12 clones, 7 of which were at contig ends). In this way, 99.99% of the testable fingerprint bands were confirmed. Additionally, the order of the in silico digests placement in the FPC database was consistent with the assembly position, with the only minor discrepancies found adjacent to contig ends where sequence had been used to resolve the final assembly ordering. Tiling Path Verification, Overlap Polymorphisms and Assaying mRNA/genomic discrepancies To evaluate clone overlaps where the rate of difference between overlapping clone sequences was higher than 1 in 1000 bases, a PCR product encompassing the differences was sequenced from each BAC in the overlap region and from each of a panel of 24 ethnically diverse genomic DNA samples8. If the 24 samples showed allelic variation, the overlap was judged to be correct, but if the 24 samples yielded persistent heterozygosity, the sequence was determined to be derived from a repeated sequence, with sequence differences between the copies. To investigate discrepancies between mRNA and genomic sequence, PCR products were generated, resequenced and the polymorphic based examined in the DNA from 24 individuals, the original BAC and in some cases other BACs. Initially, we tested 21 and 32 overlaps for chromosome 2 and 4, respectively, that contained at least three differences per kb. There were 3,547 differences in 955kb or 3.7 differences per kb on average. There were 8,785 differences in 2186kb or 4.0 differences per kb on average. The highest rate of apparent variation identified was 5.6 per kb on chromosome 2 and 6.3 per kb on chromosome 4. We further evaluated the overlap regions of 2,718 human clones (from chromosomes 2, 4 and 7) by aligning each overlap using cross_match (P. Green, unpublished) using the following parameters: -minmatch 50 –minscore 2000 – discrep_lists –tags –penalty –1. We counted the number of phred18 high quality (base quality value >19) substitution, insertion and deletion events (where each insertion or deletion event, regardless of the numbers of basepairs inserted, was counted as one polymorphic event). We counted events in 5kb non-overlapping windows. Of those 2,718 overlaps (8,309 windows), 678 had at least 5kb of overlap and more than 3 “polymorphic events” in at least one of the 5kb windows. The greatest number of events per window was 136. We identified 24 different regions where there were at least 3 consecutive windows with more polymorphic events than two standard deviations from the mean. When considering substitution + insertion + deletion events, the mean number of polymorphic events per 5kb window was 5.23 (s.d. 6.44). There were 19 overlaps with at least 3 consecutive windows of > 2 standard deviations (>18). When considering only subsitution events, the mean number of events per 5kb window was 4.20 (s.d 5.57). There were 21 overlaps with at least 3 consecutive windows greater than 2 standard deviations (>15). When considering only insertion + deletion events, the mean number of events per 5kb windows was 1.03 (s.d. 1.48). There were 4 overlaps with at least 3 consecutive windows of greater than 2 standard deviations. We examined these regions for features that may indicate balancing selection. Of the 24, 15 were in the region of a gene. Sequencing in other primates We attempted to amplify across each exon in each gene of interest from the following primates: Homo sapiens, celebes crested macaca, sumatran orangutan, Gorilla gorilla, black-handed spider monkey (new world monkey), and chimpanzee (Pan troglodytes). Primers were chosen in highly conserved human/chimp intronic regions directly flanking the exons. Multiple primers were chosen to increase the possibility of getting a successful product. For primates where we were not able to amplify a given exon, new primers were chosen based on sequence using conservation data from sequence from other monkeys where amplification had been successful. For mRNAs with no similarity to mouse (even after translating mouse genomic sequence in all 6 frames) or to anything in the non-redundant protein database, we resequenced six : a) NM_022492, exon 4 of 9 , b) BC021739, exon 2 of 5, c) NM_153031, exon 1 of a 2 exon gene, d) NM_152997, exon 3 of a 5 exon gene, e) NM_178497 exon2 of a 2 exon gene, and f) NM_005750, exon 1 of a 2 exon gene. As described in the text, only NM_153031 did not have an open reading frame in all of the sequenced exons. For our detailed examination of NM_153031, we sequenced exon 1, the only coding exon (501 bp of coding nucleotides) of a 2 exon gene (NM_153031/gi 23308528/ hypothetical protein FLJ32063). The chimp PCR product gave slightly larger products than the other products. Only chimp had an open reading frame throughout the exon and was 99% identical at the nucleotide level to the mRNA. In both orangutan (97% identical) and gorilla (87% identical), there is an early frameshift (and several other frameshifts later in the exon). In monkey (92% identical), a frameshift occurred about halfway through the coding region followed by other frameshifts. (If the frameshifts are artificially removed, the percent identities at the amino acid level are chimp 99%, gorilla 77%, monkey 83% and orangutan 95%. Again, if frameshifts are artificially removed, the Ka/Ks data suggest these genes are evolving at varying rates (human/chimp=0.342; human/gorilla=0.881; human/monkey=0.563; human/orangutan=0.562). Based on these data, this protein appears to only be functional in human and chimp (if it is functional at all) and not in the more distant primates. Sequence-Map Comparisons Locations of STS markers are determined using a combination of three methods. First, all available complete marker sequences are aligned using BLAT19 with parameter – ooc=11.ooc. These are further filtered to include only the best alignments with at least 60% coverage. Second, fasta sequences created using primer sequence information are aligned using BLAT with parameters -tileSize=10 -ooc=10.ooc -minMatch=1 minScore=1 -minIdentity=75. These are then filtered based on the number of mismatches and deviance from the reported product size. For cases with no mismatches, the size is allowed to deviate up to 200 bases. Similarly, combinations of one mismatch/150 bases, two mismatches/50 bases, and three mismatches/25 bases are used. Third, e-PCR20 is run using primer information with parameters N=1 M=50 W=5, where N is the number of allowed mismatches, M is the allowed deviance from the reported product size, and W is the word size. The results of the three methods are combined with preference being given to full sequence alignments. That is, in cases where full sequence alignments are found, these are the only placements reported, otherwise primer-based locations are reported. In build34, placements were located for 406 of 407 deCODE21 markers on chr2, and 299 of 302 markers on chr4. The missing markers are D2S2355 (chr2 37.04), D4S2969 (chr4 66.38). 80.96), D4S2627 (chr4 91.86), and D4S1635 (chr4 The deCODE markers provide a reasonable test of the long range consistency of the construction of the chromosomal sequence in that they are distributed fairly evenly across the chromosome. On chromosome 2, 406 markers are distributed over 243387553 bp (density = 1 per 599,477 bp) evenly distributed except for centromere (approx 8.9e+07 to 9.7e+07) with 0 markers and the region between 4.7e+07 and 5.6e+07 which has only 4 markers. For chromosome 4, 299 markers are distributed over 191713478 bp (density = 1 per 641,182 bp) evenly distributed except for 4.5e+07 to 5.2e+07 that has 4 markers and 1.25e+08 to 1.35e+08 which has 4 markers. There is only one case where the positions of the deCODE markers relative to the sequence suggest an inversion: [AFM112YD4 genetic position (chr2, 258.19, 0) physical position (chr2,241488777)/ AFM356TE5 genetic position (chr2, 256.32, 0) physical position (chr2,241560233) ]. As described in the main text, <1% of the genetic markers could not be found on the chromosome 2/4 sequence. We categorized those 78 markers having sequence that could not be localized. For example, 43 were largely repetitive making them difficult to localize reliably by computational methods. Additionally, 23 were localized to chromosomes other than chromosome 2 or 4. A total of six fell in regions where we currently have remaining gaps in the chromosome sequence. Finally, six fall in regions where we do not have any corresponding sequence gaps suggesting polymorphisms, unknown gaps in coverage, or clerical errors during genetic mapping. GC content GC content was measured in 20kb windows across the chromosomes. As described in the paper, but provided with more detail here, the GC content of chromosomes 2 (40.2%) and 4 (38.2% overall, with the q-arm 37.4%) is lower than that of the genome as a whole (41%). On chromosome 2, the region of lowest GC content is 40 kb of 29.9% containing no known genes. The regions of highest GC content are near the q-telomere on both chromosomes. Adjacent to sequence gaps on chromosome 2 and 4, we find rapid increases in GC content. On chromosome 4, 19% of the chromosome has a GC content of less than 35% as compared to 9% in the genome as a whole, yet one of the highest GC content windows in the genome at 72.7% is also located on chromosome 4. This 20 kb region is only 25% repetitive and contains the double homeobox 4 gene (DUX4), which may play a role in facioscapulohumeral muscular dystrophy22. Also related to the overall low GC content, 2.3% of chromosome 4 has a GC content of >=50%, while 7.4% of the genome has a GC content of >=50%. CpG Islands CpG islands23 were predicted by searching the sequence one base at a time, scoring each dinucleotide (+17 for CG and -1 for others) and identifying maximally scoring segments. Each segment was then evaluated to determine GC content (>=50%), length (>200bp), and ratio of observed proportion of CG dinucleotides to the expected proportion on the basis of the GC content of the segment (>0.6), using a modification of a program developed by G. Miklem (personal communication). A CpG island was defined to overlap a predicted gene if it fell within 5 kb upstream or 1 kb downstream of the predicted first exon. A 200kb window size was used on repeat-masked DNA for this analysis. Analysis of the repeat-masked chromosome 2 and 4 sequences revealed 1,662 (7/Mb) and 1,004 (5.4/Mb, the lowest density of any human chromosome) CpG islands with an average length of ~800 bp. Of the known mRNAs localized to chromosomes 2 and 4 (see below), the 5’ end of 69.5% and 66.5% were at or near 5 kb upstream to 1 kb downstream from a CpG island. For the full gene sets presented here, the number with overlapping CpG islands was 67% (chromosome 2) and 64% (chromosome 4), higher than the reported figure of 60% for the genome as a whole24. Identifying Known Genes We built a non-redundant set of human mRNAs using the human division of REFSEQ25 and the human division of the Mammalian Full Length cDNA collection26 (October, 2003). We removed redundancy between the two gene sets retaining the largest mRNA in cases where an identical shorter mRNA was detected. Each mRNA was searched against a soft-masked (see repeat masking methods) chromosome 2 and 4 sequence using BLAT19 and the following parameters: -q=rna -trimHardA –fine. Those mRNAs matching with >=30% coverage were re-aligned using SPIDEY27, BLAT19, EXALIN (Zhang M, Yeh RT, Gish W, manuscript submitted) and EST_GENOME28 to their matching chromosomal region(s) with an additional 100,000 basepairs added to the start and end of the genomic sequence. Parameters used for each application were as follows (strand assigned by first BLAT round where needed): SPIDEY: -S <strand> -d T BLAT: -q=rna –trimHardA –fine -ooc=11.ooc EXALIN: -p HumanSPM.par –s 2 –i <strand> EST_GENOME: -match 5 –mismatch 11 –gap_penalty 11 –intron_penalty 130 – splice_penalty 50 We sorted the output from each program to find the best alignments based on these criteria: (a) coverage/identity of the mRNA (b) coverage/identity of the coding sequence (c) percentage of canonical splice sites (GT/AG, CT/AC or GC/AG). In addition to the criteria used to determine candidate genes on chr 2/4, each mRNA was searched against the entire human genome (build34) to find matches >=60% coverage. Those mRNAs were then aligned using SPIDEY to each of these regions to verify that the best hit for each mRNA truly was to chromosome 2 or 4. Further, all matches giving a single exon hit on chromosome 2/4 and having a multi-exon hit on another chromosome were classified as potential pseudogenes (chromosome 2: 193 REFSEQ + 292 MGC, chromosome 4: 157 REFSEQ + 188 MGC) All mRNAs which aligned with >97% mRNA coverage and identity, 100% protein coverage/identity and 100% of canonical splice sites were considered “good” genes and required no further manual review (chromosome 2: 714 REFSEQ + 484 MGC, chromosome 4: 430 REFSEQ + 257 MGC). The remaining mRNA alignments were subjected to full manual review (chromosome 2: 534 REFSEQ + 313 MGC, chromosome 4: 252 REFSEQ + 135 MGC). All discrepancies were manually evaluated both in the genomic and mRNA sequences. To facilitate the manual review process EAnnot (see detailed methods description below) provided information about internal stops, frameshifts and mismatches of the gene models with the translation of the underlying mRNA. Also, based on other mRNA and EST information from GenBank, EAnnot reported the position and possible source of differences, polymorphisms, genomic sequence and underlying mRNA errors. If there were ESTs or mRNAs matching both underlying mRNA and genomic sequence the difference was assumed to be due to a polymorphism. If the genomic sequence was different than the underlying mRNA but there were other mRNAs or ESTs that supported the genomic sequence, the difference was attributed to the underlying mRNA sequence error or possible polymorphism. If there were no mRNAs or ESTs to support the genomic sequence or if the source of the difference could not be determined with confidence the problematic region was sent for screening in a panel of 24 individuals to determine the source of the discrepancy. Gene models were manually adjusted based on genomic sequence data. These changes included extending or truncating translation by finding small 5' exons and using alternative stop sites based on genomic sequence data, adjusting untranslated regions and correcting mis-identified pseudogenes. If needed splice sites were adjusted to keep translation in frame and regions around non-canonical splice sites were checked for the presence of alternative canonical splice sites that would not change the CDS. As a part of the checking process, the annotated CDS was translated and searched against the translated mRNAs for verification. After creation of the final non-redundant set of mRNAs, a summary of the performance of each of the alignment tools was created. Each genomic region (extended by 2kb in each direction) was given to the tools along with the mRNA sequence identical to that region. In 75% of the cases on chromosome 2 and 81% of the cases on chromosome 4, the SPIDEY gene prediction was in agreement identical to the final mapping of the mRNA. Further, 84% of the exons on chromosome 2 and 83% of the exons on chromosome 4 were predicted exactly. For EST_GENOME, 79% on chromosome 2 and 82% on chromosome 4 of the full genes, and 95% of the exons on both chromosomes 2 and 4 were predicted identically to the final mRNA. For EXALIN, 83% on chromosome 2 and 88% on chromosome 4 of the full genes and 91% on chromosome 2 and 92% on chromosome 4 of the exons were predicted identically to the final mRNA. For BLAT, 92% on chromosome 2 and 94% on chromosome 4 of the exons were predicted identically to the final mRNA. Additionally, for BLAT in 94% of the mRNAs on chromosome 2 and 95% on chromosome 4 were aligned identically over greater than 99% of their length. Only genes with identities to known mRNAs have been entered into the GenBank annotation for each clone. Genes in other classes are available on our web site but where not submitted to the public databases. To create a non-redundant set of known genes, first, we removed identical predictions. Next, we manually examined each locus for any remaining redundant forms that were not removed automatically. When two forms had the same CDS but differed in the length of their UTRs we kept the longest one. All known genes with multiple variants were verified by their entries in LocusLink. Known genes without similarity to mouse A set of 102 genes predicted from the known mRNAs did not have any similarities to the mouse protein database, 65% of which were single exons. Of these genes, 66 were found to have some similarity to mouse or rat when instead searching the entire translated genomic sequences. Only three of these, however, were convincing similarities and many of the 66 may well only be conserved UTR. In the end, 36 had no match at all to mouse or rat sequences and are described in the main text. Identifying Frameshifts in Known Genes using HAPMAP HAPMAP6 reads were aligned to the human genome (hg16/build34) using SSAHA29. Only best in genome alignments were retained with greater than or equal to 97% identity. We identified those high quality insertion/deletion SNPs (quality >=25 and average quality >=30) that overlapped CDS exons and manually reviewed the alignments. We detected 27 potential frameshifts in coding regions of REFSEQ mRNAs on human chromosome 2, only 6 of which were annotated in dbSNP. On chromosome 4, we detected 15 possible frameshifts in coding regions of REFSEQ mRNAs, only one of which was annotated in dbSNP. Only in the case of FLJ38379 did the frameshift result in a “longer” open reading frame. In all other cases, the frameshift resulted in a truncated protein. Alternative splicing evaluation and polyA signal identification A tool, EAnnot30, has been developed to incorporate and cluster vertebrate mRNA and EST data as well as polyA signal identification to predict gene structures, identify polymorphism data, and integrate available data to annotate the resulting genes. For this analyses, EAnnot was used for polyA identification and quantification of alternative splicing events. EAnnot gene boundary determination. EAnnot begins with correcting mapping gaps, repetitive mappings, misidentification of splice sites, and other errors. EAnnot then gives the strand assignments to spliced mRNAs and ESTs based on splicing consensus provided by SplicePredictor31 (http://ftp.zmdb.iastate.edu). EAnnot also utilizes information from dbEST32 regarding the clonal origin of the EST that is critical for the positive identification of ESTs linked to the same clone (clone linking). Gene boundaries are further determined by sequence overlaps, strands, and clone linking. The number of gene boundaries reflects the number of genes. Gene prediction proceeds within each gene boundary. EAnnot Sequence clustering. EAnnot sorts mRNAs into mRNA clusters based on splicing patterns. EAnnot then compares ESTs with mRNA clusters to identify ESTs with novel splicing patterns (alternatively spliced ESTs). These alternatively spliced ESTs are further grouped into EST clusters. Non-redundant mRNA clusters and novel EST clusters are used to generate gene models. For newly identified gene boundaries (loci) without mRNA information, all ESTs are clustered. PolyA identification: All ESTs and mRNAs (spliced and non-spliced) are examined for the presence of a stretch of As at the 3' end of the sequence or a stretch of Ts at the 5' end of the sequence. The presence of at least 8 As or Ts among the last or first 10 bases of the sequence is required for the further analysis. If at least 8 As or Ts are found, EAnnot will continue to count As immediately before the last 10 bps or Ts immediately after the first 10 bps until a different nucleotide is found. EAnnot requires at least 5 As or Ts at either end of the sequence not mapped to genomic sequence. In addition, the difference between the number of As or Ts at either end of the sequence and the number of As or Ts not mapped to the genomic sequence should not exceed 8. If all conditions listed above are satisfied, polyA signals (AATAAA or ATTAAA) will be searched within 50 bps upstream from the polyA adenylation site. If the polyA sites and the polyA signals are found, they are assigned in pairs. EST comparisons All human expressed sequence tags (ESTs)32 in GenBank (July 15, 2003) were searched against NCBI build34 of the human genome using BLAT. There were 4,643,853 human ESTs which met the following criteria of: EST length >50 bp, fewer than 50 alignments in the genome, <=5% substitutions, <=4% insertions, >=80% of the EST aligned. Of those 4.6 million ESTs, 262,813 (6% of the ESTs) found their best similarity to chromosome 2 and 139,791 (3% of the ESTs) found their best match to chromosome 4. The ESTs were associated with their overlapping known or predicted gene but were not integrated with mRNA data or with other gene predictions to produce new gene structures because of the ambiguity inherent in the data, i.e. it is impossible to be certain which transcript/isoform a given EST is associated with. However, the ESTs were clustered with the mRNA data to provide overall estimates for alternative splicing (see above). Homology-based prediction of predicted protein coding genes and pseudogenes GENEWISE33 Pseudogenes and predicted genes were predicted using similarity to known proteins following the protocol originally developed for the annotation of mammalian pseudogenes5,34,35. In brief, candidate (pseudo)genic regions were first identified in chromosome 2 and 4 by comparing all DNA fragments between common repeats with a non-redundant protein database (RefSeq + SwissProt36 + trEMBL36) using BLASTx29 (Evalue cutoff = 0.001). In order to find and delimit regions corresponding to independent genes or pseudogenes, the intervals containing all DNA regions in the same chromosome that had common matching known proteins were reanalyzed separately with BLAST2GENE37. Regions with repetitive, viral, coiled coil and LCR fragments were also filtered out of the candidate set. Finally, to obtain the corresponding (pseudo)gene structure and information about in-frame truncations present in pseudogenes, each of these regions was further compared against the closest protein sequence using GENEWISE33 (Score > 20).This procedure led us to the identification of an initial set of 2945 chromosome 2 and 1738 chromosome 4 candidate (pseudo)genes. In parallel, we also applied the same similarity search through the remaining human and the complete mouse genomes, which led to the identification of 43,247 (including chromosomes 2 and 4) and 43,715 (pseudo)genes for each of the organisms respectively. Fragmentation was removed from the GENEWISE set using manual inspection of cases where multiple neighboring GENEWISE predictions had similarities to multiple regions of the same protein. As human-mouse orthology relations are expected to be detectable only for functional, and not for neutral evolving regions, we have used this basic criteria to differentiate between genes and pseudogenes. For this analysis, to identify the orthologous regions of mouse (Mm3) and human (hg16/build34) and thus the mouse and human (pseudo)gene sets we have applied a two step approach, based on sequence comparison and evaluation of genomic position. First we have compared with BLASTP the translations of all human and mouse predicted regions (stop codons and frameshifts were circumvented for this comparison) and extracted up to 23,000 best reciprocal mouse-human hits that we take as candidate orthologous pairs. We accepted as reliable ortholog pairs only best reciprocal hits that had other neighboring best reciprocal hits in both organisms. With this filter we excluded those pairs detected where recently formed processed pseudogenes (almost identical to their parental gene) were likely to be involved (the majority of these pairs were formed by a single exon in one organism and an intron containing prediction in the other). By fixing the accepted pair of orthologs, we compared with BLASTp the unassigned human predictions in between with the corresponding unassigned mouse predictions in the orthologous region. KA/KS analysis was performed using the PAML package38. We inferred the ancestral sequence of each of the target sequences (A) with the PAMP program using a protein-based DNA multiple alignment (CLUSTALW39). As normally KaKs values cannot be directly used for the distinction of genes and pseudogenes, in order to evaluate the pattern of evolution and hence the functionality of a given candidate sequence set, we need to compare its distribution of KaKs values with benchmark distributions obtained from reliable functional and pseudogenic regions. The associated error (calculated to be of 3%) comes from an internal cross-validation performed with benchmark distributions and reflecting its discriminative power (see 534). TWINSCAN40-42 The TWINSCAN (http://www.genes.cs.wustl.edu) program is designed to analyze eukaryotic genomic sequences by simultaneously modeling both gene structure and evolutionary conservation. Here, genes were predicted for human chromosomes 2 and 4 using the entire mouse genome sequence41 as the informant using TWINSCAN version 1.2 available from the authors web site. The resulting set for chromosome 2 was composed of 1925 predictions with an average of 8.85 exons per gene. The chromosome 4 set consisted of 1172 predictions with 7.65 exons per gene. These sets were then filtered for inclusion in the final gene set. Initially, each TWINSCAN prediction was compared to a mouse protein database using WU-BLASTP and the following parameters: M=BLOSUM80 Q=12 R=4 topcomboN=1 -sort_by_totalscore -sort_by_subjectlength E=0.01 hspmax=0 B=1e9 V=1e9 Y=1e7 Z=1e7 restest The mouse protein database was composed of mouse ENSEMBL predictions43 (6 May 2003, v. 18.30.1), and TWINSCAN predictions for all of mouse (v1.2 against human build 34/hg16 available from the authors web site). Similarities within a P-value exponent range of 9 from the most significant similarity were examined to find an unprocessed (multi-exon) mouse prediction (preference given to those similarities in the orthologous region of mouse as defined by the mouse/human orthology maps available from http://www.genome.ucsc.edu; on human chromosome 2 and 4 there are 48 and 32 identifiable segments sharing the same order of highly conserved sequences in the two species). This information was also used to indicate processed pseudogenes when single exon gene predictions found similarities with multi-exon gene predictions in mouse, and the aligned region in the human prediction spanned an intron in the multi-exon mouse gene prediction. Because of differing fragmentation in the human and mouse predicted gene sets, the aligned region of the mouse protein (rather than the entire protein) was then compared back to a human protein database also using WU-BLASTP and the parameters listed above for the human to mouse comparison. In summary, we reduced the number of false positives and pseudogenes by requiring that the predicted genes have a highly significant match in the mouse gene set in the orthologous region of mouse where assigned, and in turn that the matching mouse gene have as its best match the original chromosome 2 or 4 predicted gene. In detail, the human protein database was composed of the human ENSEMBL predictions (NCBI build34/hg16, 4 Nov 2003 v 18.34.1) and human chromosome 2 and 4 known and predicted genes. Each gene prediction with a similarity with a p-value exponent of 9 of the best hit were examined to look for a protein which overlapped the original human chromosome 2/4 gene prediction. If one was found, it was defined as reciprocal or reciprocal best. If there was not a similarity within a p-value exponent of 9 of the best similarity, but there was a similarity at >e-49 (or more significant than 1e-29 if the best hit was less than 1e-49), that original human chromosome 2/4 prediction was defined as "reciprocal friend". The categories of reciprocal best and reciprocal friend were further subdivided into those where the aligned regions contained an intron versus those which did not. While these “reciprocal friends” have some likelihood of being unprocessed pseudogenes, information derived from the search of these proteins against the rest of the human genome and Ka/Ks information was used to estimate the likelihood that each of these are pseudogenes. Comparisons of the predicted proteins to mouse protein databases were also used to identify processed pseudogenes (single exon genes whose similarity spanned an intron in a multi-exon human gene) and to identify possible duplication events and gene families. Single exon genes with similarities to multi-exon mouse genes, where the region of similarity spanned an intron in the mouse gene, were identified as possible pseudogenes. Each predicted gene was also compared to a database of human proteins to aid in identifying processed pseudogenes and to identify the closest relative of each protein to aid in identification of unprocessed pseudogenes. Therefore, single exon genes with matches to multi-exon genes in either the human or mouse genomes, predictions showing signs of nonfunctionality, and those that produced L1/reverse transcriptase were also removed from the set. After categorization of each of the human chromosome 2/4 predictions, those predictions were filtered to form a non-redundant set in a method similar to that used for human chromosome 7. First, all known genes were considered as absolute, only identical versions of known genes and identical subsets of known genes were removed. From this base, all predictions which overlapped known genes were removed. If there were no known genes in a region, a hierarchical method for removing predictions was used. Any GENEWISE predicted protein coding gene which did not overlap a known gene was kept as the next set of “reliable” predictions. Finally, any “reciprocal best” TWINSCAN prediction was kept that did not overlap either the known gene or GENEWISE sets. Annotation of RANBP2 region We are particularly interested in defining the evolutionary mechanism of the RanBP2 gene cluster, comprised in the chromosome 2 region (87050000-113380000). We manually reviewed annotation of the genes surrounding each of the cluster forms. In doing this, we defined two main regions of interest: 1- chr2:87050000-88235000 (region1), 2- chr2:106600000-113380000 (region2). At this point, we set up a procedure to proceed with manual annotation: 1) take the known gene set for region1/region2. In case of overlapping matches, take the best in coverage, with preference to SwissProt; 2) Blast2seq against region1/region2 using the protein sequence; 3) Manually verify each match with E<0.001 and annotate in region1/region2 only the reliable matches; 4) map ESTs for each predicted gene Of course, applying this procedure we obtained matches involving genes but also some involving fragments. All of these were checked for overlap with the pseudogene sets and/or merged with the existing gene set. Protein Similarity Searching After creation of the chromosome 2 and 4 gene sets, the predicted amino acid sequences were each compared to the August 22, 2004, NCBI non-redundant protein database using WUBLAST and the following parameters: topcomboN=1 -sort_by_totalscore -sort_by_subjectlength E=1e-6 hspmax=0 B=1e9 V=1e9 At this threshold (E=1e-06), 98% of the final chromosome 2 and 4 gene sets found some similarity in the non-redundant protein database. Comparative Data Protein sets for Tetraodon nigroviridis44 (assembly version 7; annotation release 25.2.2, Feb 9, 2004; Genoscope data available via http://www.genoscope.cns.fr/externe/tetraodon/Ressource.html), Fugu rubripes45,46 (Fugu build 2c; annotation release 25.2c.1 from July 5, 2004; http://fugu.hgmp.mrc.ac.uk), and Danio rerio (sequencing data were generated by the Zebrafish Sequencing Group at the Sanger Centre and data are available from http://www.sanger.ac.uk/Projects/D_rerio ; assembly release Zv4: annotation release 25.4.1 from Feb 9, 2004) were downloaded from http://www.ensembl.org. The Ciona intestinalis47 protein set (release 1.0 April 2002) was downloaded from http://genome.jbgi-psf.org/ciona/ . Each human chr 2/4 protein was compared to the relevant databases using BLASTP with the following settings (topcomboN=1 E=1e-04 hspmax=0 -filter=seg+xnu B=1e9 V=1e9). Interpro Interpro48 (http://www.ebi.ac.uk/interpro) combines sequence-pattern information from available databases and integrates motif and domain assignments into a hierarchical classification system. Comparisons were with Interpro 8.0 using Interproscan49 v3.3. In the Interpro analyses, 994 (74%)and 584 (73%) of the proteins can be classified at a threshold of E=0.01. Of those, 662 on chromosome 2 (67%) and 395 on chromosome 4 (68%) are multi-domain. The Pfam analyses for Table 2 involved only those matches meeting the threshold of E=0.01. Segmental Duplication Analysis We used a BLAST-based detection scheme53 to identify all pairwise similarities representing duplicated regions (≥1 kb and ≥90% identity) within the finished sequence of chromosome 2 and chromosome 4 and compared to all other chromosomes in the NCBI genome assembly (build 34). Highly identical duplications were confirmed by a second detection strategy which assays for excess random-read coverage across the genome13. While there was, in general, very good correspondence between these two methods, one 200 kb region (chr4;19166956-19138783) was identified by whole-genome shotgun sequence detection that could not be confirmed by whole-genome analysis of the human genome (build34). The program Parasight (J. Bailey, unpublished) was used to generate images of pairwise alignments. We also analyzed pairwise alignments for percent identity and the number of aligned bases. Satellite repeats were detected using RepeatMasker (version: 2002/05/15) on slow settings (A Smit and P Green, unpublished). References 1. The International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860-921 (2001). 2. The International Human Genome Mapping Consortium. A physical map of the human genome. Nature 409, 934-941 (2001). 3. Felsenfeld, A., Peterson, J., Schloss, J. & Guyer, M. Assessing the quality of the DNA sequence from the Human Genome Project. Genome Res. 9, 1-4 (1999). 4. Riethman, H. C. et al. Integration of telomere sequences with the draft human genome sequence. Nature 409, 948-951 (2001). 5. Hillier, L. W. et al. The DNA sequence of human chromosome 7. Nature 424, 157-64 (2003). 6. The International HapMap Consortium. The International HapMap Project. Nature 426, 789-796 (2003). 7. Benson, D. A., et al. Genbank: update. Nucleic Acids Res. 1, D23-6 (2004). 8. Sachidanandam, R., et al. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 409, 928-933 (2001). 9. Pruitt, K. D., Tatusova, T. & Maglott, D. R. NCBI Reference Sequence project: update and current status. Nucleic Acids Res 31, 34-7 (2003). 10. Strausberg, R. L. et al. Generation and initial analysis of more than 15,000 full-length human and mouse cDNA sequences. Proc Natl Acad Sci U S A 99, 16899-903 (2002). 11. Wheeler, D. L. e. a. Database resources of the National Center for Biotechnology Information: update. Nucleic Acids Res 32, D35-40 (2004). 12. Ohno, S., Wolf, U. & Atkin, N. B. Evolution from fish to mammals by gene duplication. Hereditas 59, 169-87 (1968). 13. Bailey, J. A. et al. Human-specific duplication and mosaic transcripts: the recent paralogous structure of chromosome 22. Am J Hum Genet 70, 83-100 (2002). 14. Osoegawa, K., et al. A Bacterial Artificial Chromosome library for sequencing the complete human genome. Genome Res. 11, 483-496 (2001). 15. Osoegawa, K., et al. An improved approach for construction of Bacterial Artificial Chromosome libraries. Genomics 52, 1-8 (1998). 16. Kim, U. J., et al. Construction and characterization of a human bacterial artificial chromosome library. Genomics 34, 213-218 (1996). 17. Marra, M. A., et al. A map for sequence analysis of the Arabidopsis thaliana ge. Nature Genet. 22, 265-270 (1999). 18. Ewing, B., Hillier, L., Wendl, M. C. & Green, P. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res 8, 175-185 (1998). 19. Kent, W. J. et al. The human genome browser at UCSC. Genome Res 12, 996-1006 (2002). 20. Schuler, G. D. Sequencing mapping by electronic PCR. Genome Res 7, 541-550 (1997). 21. Kong, A. et al. A high-resolution recombination map of the human genome. Nat Genet 31, 241-247 (2002). 22. Gabriels, J. et al. Nucleotide sequence of the partially deleted D4Z4 locus in a patient with FSHD identifies a putative gene within each 3.3 kb element. Gene 236, 232-235 (1999). 23. Gardiner-Garden, M. & Frommer, M. CpG islands in vertebrate genomes. J. Mol. Biol. 196, 261-282 (1987). 24. Antequera, F. & Bird, A. Number of CpG islands and genes in human and mouse. Proc. Natl. Acad. Sci. USA 90, 11995-11999 (1993). 25. Pruitt, K. D. & Maglott, D. R. RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res. 29, 137-140 (2001). 26. Mammalian Gene Collection Program Team. Generation and initial analysis of more than 15,000 full-length human and mouse cDNA sequences. Proc. Natl. Acad. Sci. USA 99, 16899-16903 (2002). 27. Wheelan, S. J., Church, D. M. & Ostell, J. M. Spidey: a tool for mRNA-to-genomic alignments. Genome Res 11, 1952-1957 (2001). 28. Mott, R. EST_GENOME: a program to align spliced DNA sequences to unspliced genomic DNA. Comput. Appl. Biosci. 13, 477-478 (1997). 29. Ning, Z., Cox, A. J. & Mullikin, J. C. SSAHA: a fast search method for large DNA databases. Genome Res 11, 1725-1729 (2001). 30. Ding, L. et al. EAnnot: a genome annotation tool using experimental evidence. Genome Res 14, 2503-2509 (2004). 31. Usuka, J., Zhu, W. & Brendel, V. Optimal spliced alignment of homologous cDNA to a genomic DNA template. Bioinformatics 16, 203-211 (2000). 32. Boguski, M. S., Lowe, T. M. & Tolstoshev, C. M. dbEST--database for "expressed sequence tags. Nature Genet. 4, 332-333 (1993). 33. Birney, E. & Durbin, R. Using GeneWise in the Drosophila annotation experiment. Genome Res. 10, 547-548 (2000). 34. Torrents, D., Suyama, M., Zdobnov, E. & Bork, P. A genome-wide survey of human pseudogenes. Genome Res 13, 2559-67 (2003). 35. RGSP-Consortium. Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature 428, 493-521 (2004). 36. O'Donovan, C., et al. High-quality protein knowledge resource: SWISS-PROT and TrEMBL. Brief Bioinform. 3, 275-284 (2002). 37. Suyama, M., Torrents, D. & Bork, P. BLAST2GENE: a comprehensive conversion of BLAST output into independent genes and gene fragments. Bioinformatics, Epub ahead of print (2004). 38. Yang, Z. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. 13, 555-556 (1997). 39. Thompson, J. D., Higgins, D. G. & Gibson, T. J. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positionspecific gap penalties and weight choice. Nucleic Acids Res. 22, 4673-4680 (1994). 40. Flicek, P., Keibler, E., Hu, P., Korf, I. & Brent, M. R. Leveraging the mouse genome for gene prediction in human: From whole-genome shotgun reads to a global synteny map. Genome Res. 13, 46-54 (2003). 41. Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature 6915, 520-562 (2002). 42. Korf, I., Flicek, P., Duan, D. & Brent, M. R. Integrating genomic homology into gene structure prediction. Bioinformatics 17, S1:140-148 (2001). 43. Hubbard, T., et al. The Ensembl genome database project. Nucleic Acids Res. 30, 38-41 (2002). 44. Roest-Crollius, H. et al. Estimate of human gene number provided by genome-wide analysis using Tetraodon nigroviridis DNA sequence. Nature Genet. 25, 235-238 (2000). 45. Elgar, G., et al. Generation and analysis of 25Mb of genomic DNA from the pufferfish Fugu rubripes by sequence scanning. Genome Res. 9, 960-971 (1999). 46. Aparicio, S. et al. Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes. Science 297, 1301-1310 (2002). 47. Dehal, P. et al. The draft genome of Ciona intestinalis: insights into chordate and vertebrate origins. Science 298, 2157-2167 (2002). 48. Mulder, N. J., et al. InterPro: an integrated documentation resource for protein families, domains and functional sites. Brief Bioinform. 3, 225-235 (2002). 49. Zdobnov, E. M. & Apweiler, R. InterProScan--an integration platform for the signaturerecognition methods in InterPro. Bioinformatics 17, 847-848 (2001). 50. Schwartz, S., et al. Human-Mouse alignments with BLASTZ. Genome Res. 13, 103-107 (2003). 51. Kolbe, D. et al. Regulatory potential scores from genome-wide three-way alignments of human, mouse, and rat. Genome Res 14, 700-707 (2004). 52. Chang, C. C. & Chih-Jen, L. (2001). LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjilin/libsvm/ 53. Bailey, J. A., Yavor, A. M., Massa, H. F., Trask, B. J. & Eichler, E. E. Segmental duplications: organization and impact within the current human genome project assembly. Genome Res 11, 1005-17 (2001). 54. Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature 431, 931-945 (2004). Supplemental Legends Supplementary Figure 1 Comparison of mapped positions of deCODE STSs and their location within the (a) chromosome 2 and (b) chromosome 4 sequences. The apparent break at 90 Mb (a) and 50 Mb (b) reflects the position of the centomere. Supplementary Figure 2 Sequence identity distribution of segmental duplications. For all pairwise alignments, the total number of aligned bases was calculated and binned based on percent sequence identity. Sequence identity distributions for interchromsomally (red) and intrachromosomally (blue) duplicated bases are shown. Assuming a clock-like rate for nuclear DNA substitution, percent similarity is linear with evolutionary time providing a surrogate for the age of the duplication or gene conversion event. Supplementary Figure 3 Pattern of recent Segmental Duplications on Chromosome 2 and chromosome 4 .Large (>10 kb) highly-similar (>95%) intrachromosomal (blue) and interchromosomal (red) segmental duplications are shown for chromosome 2 (a) and chromosome 4 (b). Chromosome 2 (a) and chromosome 4 (b) are drawn at a greater scale with respect to other human chromosomes. The ancestral telomere-telomere fusion region: (A) and the likely ancestral centromere region (B) are shown for chromosome 2. A 750 kb tandem intrachromosomal duplication cluster in chromosome 4; (C) demarcates a tandem cluster of microsomal UDP-glucuronosyltransferase genes; Two subtelomeric duplication clusters (D and E) are displayed in chromosome 4: (D) interstitial 1.1 Mb region of subtelomeric duplication identified within 4q26; (E) sixteen duplications ranging in size from 10-20kb distributed among nine subtelomeric regions (chromosome 1,2,5-8,10,16,19, and 20). Supplementary Figures 4 and 5 Location and % identity of segmental duplications on chromosome 2 (Supplementary Figure 4) and 4 (Supplementary Figure 5). Interchromosomal (red bars) and intrachromsomal (blue bars) are represented along the horizontal line (0.2 Mb increments). The % identity of each pairwise alignment is shown along the vertical axis with colors representing different chromosomes (color key shown at the bottom of each figure). Black bars above the horizontal line represent alpha satellite sequences. Light green bars above the horizontal line represent regions detected by whole-genome shotgun sequence detection strategy (WSSD). A ~200 kb duplication block near the FSHD (facioscapulohumeral muscular dystrophy) region of 4q35 (chr4:191166956-191387833) was detected by WSSD by not by the wholegenome analysis method. The ancestral telomere-telomere fusion region (A) and the likely ancestral centromere region (B) are shown for chromosome 2. A 750 kb tandem intrachromosomal duplication cluster in chromosome 4 (C) demarcates a tandem cluster of microsomal UDP-glucuronosyltransferase genes. Two subtelomeric duplication clusters (D and E) are displayed in chromosome 4. Supplementary Table 1 Contiguous sequence accessions and lengths for chromosome 2 and 4 sequences. Estimated gap sizes are also included as estimated by FISH and by orthologous mouse and rat sequences. Supplementary Table 2 Brief summary of repetitive content of chromosomes 2 and 4. Supplementary Table 3 Recombination rates for human chromosomes 2 and 4 based on deCODE and Genethon genetic maps. Supplementary Table 4 Known mRNAs that had no match at all to the mouse or rat genomes or their protein sets. The chromosome, gi identifier for the mRNA, number of coding exons, amino acid length, interpro domains, comments, and information on gi’s for which we obtained sequences for the mRNA in multiple primates. Supplementary Table 5 Additional details to support Table 1 in main text related to polymorphic mRNAs detected on human chromosomes 2 and 4. Supplementary Table 6 Fraction of duplicated basepairs. The percentage of non-redundant duplications (>90% sequence identity; >1 kb in length) were based on the total genome size 2,865,069,170, chromosome 2 size 237,541,603 and chromosome 4 size 186,841,959. Supplementary Table 7 Accessions identifying regions containing highly polymorphic overlap regions. The chromosome and starting and ending positions of those highly polymorphic overlap regions are provided for human genome build35/hg17. General observations of sequence features in the region including gene content are also provided. Supplementary Table 8 Positions of possible human deletions based on fosmidend placements. Supplementary Table 9 Polymorphisms detected from fosmid paired-end placements and verified by fosmid clone-based sequencing. The Broad Institute library fosmid name, fosmid clone accession, and a description of each polymorphism are provided. Supplementary Table 10 Distribution of non-redundant known genes and spliced ESTs within duplicated regions of chromosomes 2 and 4. The nonredundant transcript set was screened for confirmatory transcriptional support by best genomic placement of EST. The spliced EST and transcripts required at least two exons and evidence of splicing (two or more exons). In addition, each gene feature was binned as duplicated if at least 50 bp overlapped a duplicated region. Thus, exons less than 50bp were not included in this analysis. Known gene + EST is composed of non-redundant known gene and spliced EST transcription. Duplicated known genes and their coordinates are listed.