Supplementary Methods The high resolution metric FISH map of chromosome 19 A complete EcoRI restriction map spanning the entire length of the chromosome, excluding the centromere, provided the foundation for sequencing human chromosome 19. Initially, over 14,000 chromosome 19-specific cosmids were randomly fingerprinted using a high-resolution, fluorescence-based approach; a series of restriction digests using two 6-cutters (EcoRI, BglII) followed by three 4-cutters (DdeI, HaeIII, HinfI) generated 50-400 bp fluorescently-labeled fragments that were sized on PE/ABD Model 370A or 373 DNA Sequencers to create unique fingerprints for each clone1. Likelihood ratios representing the probability of overlap between fingerprints were generated for each cosmid pair and used in an automated contig assembly algorithm to computationally establish islands of clonal continuity. These cosmid contigs were validated and refined by manual assembly of restriction maps using fluorescently-labeled complete digest EcoRI restriction fragments (0.5 to >40 kb) sized with a Model 362 GeneScanner2. A unique component of the chromosome 19 mapping effort was the construction of a high-resolution, metric FISH map. Individual clones representing cosmid contigs were iteratively mapped by fluorescence in situ hybridization (FISH) through a series of chromatin targets with increasing resolution. Chromosomal location for selected cosmids was first established with one-color FISH to metaphase chromosomes3; binned clones were then ordered relative to each other with two and three-color FISH to metaphase and interphase nuclei4. Finally, contigs were oriented and distances between clones were estimated using three-color FISH against high-resolution pronuclear chromatin targets 1 derived from fusing human sperm with hamster eggs5, 6. The metrically resolved scaffold framework of 216 cosmid reference points defined contig order and orientation, gap locations and gap sizes to efficiently direct map closure. Contig extension and gap closure was achieved by hybridization screening end-labeled, inter-Alu, oligo and overgo probes to high density arrayed cosmid, fosmid, PAC and BAC libraries, supplemented by limited STS screening of P1s. Hybridization identified clones were digested and incorporated into restriction maps to verify extensions and gap closures. These maps were used to select sequencing tiling path clones, validate the identity of clones queued for sequencing and target sequence closure efforts. Extensive in house and collaborative efforts yielded a number of studies characterizing chromosome 19 content. Studies include but are not limited to the following: characterization of disease related genes such as myotonic dystrophy7 and chronic nephrotic syndrome (CNF)8, the DNA repair enzymes XRCC19, ERCC1 and ERCC210 defining the gene family regions for the carcino-embryonic antigen and pregnancyspecific glycoproteins11 and kallikreins12 and the evolution of the cytochrome p-450 (CYP) gene family13. The mapping history reviewed here generated distinctive attributes to the chromosome 19 effort. Because of the early completion of the restriction map relative to decreasing sequencing costs, the tiling path was originally picked somewhat parsimoniously such that numerous small gaps (<2 kb) were purposely left to be closed by alternative methods 2 such as PCR walking. However, the underlying map defined order and orientation of all sequencing contigs from the start and readily identified substrates for sequence gap closure. Furthermore, because of the pronuclear-based scaffold FISH map, the metric for all gaps, including recalcitrant gaps that required extensive hybridization in alternative libraries, was predetermined. Finally, the monochromosomal source for the cosmid library, which represents much of the tiling path, provided a single haplotype path in highly duplicated or repeat-rich regions, thus bypassing many of the problems inherent to closing these regions using clones from different haplotypes. Additional tiling set information Both gaps in the p arm of chromosome 19 are regions of the chromosome where genomic DNA appears to be unstable in the cloning vectors currently available. BAC and fosmid clones were identified to span the gaps but all clones sequenced were internally inconsistent with each other so were not included in the tiling set. Size estimates for these gaps could not be obtained from the mouse draft assembly due to breaks in synteny and FISH sizing was considered unreliable due to the unstable nature of the clones. The gaps were arbitrarily sized at 5kb. To ensure specificity in both subtelomeric regions of the chromosome, a tiling path was generated using chromosome 19 specific cosmids. The distance from the most distal cosmid to the true telomere is estimated to be 11kb for the p-telomere and 5kb for the qtelomere (Harold Riethman, pers. comm.). A 19p-telomere-containing ‘half-YAC’14 has been identified but has been too unstable to sequence. 3 The boundary between euchromatin and heterochromatin at the centromere was identified by the presence of centromere specific alpha satellite repeats. On the p-ward side, although not included in the current tiling set, it is probable that AC136499 (NT_078103) extends further towards the centromere due to overlap with a chromosome specific cosmid AC020949 and signal by FISH. The placement is tenuous however due to the large repeat structure of the clone and failure to locate a clone that spans to AC073541. Sequencing and finishing methods BAC DNA was hydrodynamically sheared using a Hydroshear Instrument (GeneMachines, San Carlos, CA), size selected (3-4kb) and subcloned into the plasmid vector pUC18. Randomly selected plasmid subclones were sequenced in both directions using universal primers and BigDye Terminator chemistry to an average sequence depth of 8x. Sequences were then assembled and edited using the Phred/Phrap/Consed suite of programs15, 16, 17 . Following manual inspection of the assembled sequences, clones were finished by resequencing plasmid subclones and by walking on plasmid subclones or the large insert clone using custom primers. All finishing reactions were performed using dGTP BigDye Terminator chemistry (Applied Biosystems, Foster City, CA). Recalcitrant areas or hard gaps were closed with additional sequence data derived from sequencing with additives, transposon sequencing, small insert shatter libraries or PCR. Finished clones contain no gaps and are estimated to contain less than one error per 10,000 base pairs. Clones with a very high repeat content or which showed considerable bias when cloned into the pUC derived vector, had additional 10kb libraries constructed in an alternate vector with a low copy number. 4 Supplementary Finishing Information The tiling set of chromosome 19 consists of 860 finished clones. 550 of these clones were drafted at the Joint Genome Institute and finished at the Stanford Human Genome Center while 310 clones were drafted and finished at Lawrence Livermore National Laboratories. The following clones were drafted and/or finished elsewhere. AC018725 University of Wisconsin, Madison, WI, USA AC067968 University of Wisconsin, Madison, WI, USA AC084219 University of Wisconsin, Madison, WI, USA AC021092 University of Wisconsin, Madison, WI, USA AC069278 University of Wisconsin, Madison, WI, USA AC068948 University of Wisconsin, Madison, WI, USA AC006213 Whitehead Institute/MIT Center for Genome Research AC093456 Brookhaven National Laboratories, Upton, NY, USA AF037338 University of Iowa, Iowa City, IA, USA 'Completeness' of the Chromosome 19 sequence Locations of STS markers are determined using a combination of three methods. First, available complete marker sequences were aligned using blat version 2418, with parameter -ooc=11.ooc. These were further filtered to include only the best alignments with at least 60% coverage. Second, fasta sequences created using primer sequence information, were aligned using blat version 2 with parameters -tileSize=10 -ooc=10.ooc 5 -minMatch=1 -minScore=1 -minIdentity=75. These were then filtered based on the number of mismatches and deviance from the reported product size. For cases with no mismatches, the size was allowed to deviate up to 200 bases. Similarly, combinations of one mismatch/150 basepairs, two mismatches/50 basepairs, and three mismatches/25 basepairs were used. Third, e-PCR19 was run using primer information with parameters N=1 M=50 W=5, where N is the number of allowed mismatches, M is the allowed deviance from the reported product size, and W is the word size. The results of the three methods were combined with preference being given to full sequence alignments. That is, in cases where full sequence alignments were found, these were the only placements reported: otherwise, primer-based locations are reported. Reported on chromosome 19 are 121 markers in the Genethon genetic map20, 213 markers on the Marshfield genetic map21, and 120 markers on the deCODE genetic map22 with either full sequence or primer information available. In total, there are 215 unique genetic markers from these maps. Of these, 213 are found in unique locations on chromosome 19. A single marker, D19S585, is found in two locations approximately 0.5Mb apart. The last marker, D19S724, is found on chromosome 1 using both full sequence and primer information provided for the marker. This marker only appears on the Marshfield map, and it is likely represents either an error in the Marshfield map or in the sequence and primer information associated with the marker. The order of the markers in the sequence corresponds almost perfectly with that in each of the genetic maps, actually having perfect correspondence with the deCODE map. 6 Strictly speaking, there are 12 ordering inconsistencies when compared to the Genethon map, and 20 compared to the Marshfield map, each being the case of a single out of order marker. Of these 32 inconsistencies, 15 are between markers that differ by less than 1cM with the largest difference being 2.67cM. The genetic length of chromosome 19 is 109.9cM in the Genethon map, 105.02cM in the Marshfield map, and 109.73cM in the deCODE map. Locations of full-length mRNA sequences were determined using blat version 24 with parameters -q=rna -trimHardA -fine -ooc=11.ooc. Resulting alignments were filtered to report only the best alignments for each mRNA requiring at least 98% base pair identity. Full-length mRNA sequences present in RefSeq23 and the Mammalian Genome Collection24 on May 16th, 2003 were aligned. The actual sequence aligned for each are those available on August 1, 2003 in GenBank. A total of 2,055 unique mRNA sequences representing 1,166 distinct loci could be aligned with at least 95% coverage and 98% base pair identity. A single mRNA, BC008405, could be aligned at 97.4% base pair identity. This mRNA encodes the PSG4 (pregnancy specific beta-1-glycoprotein 4) gene that is annotated as containing two immunoglobulin C-2 type regions, thus the reduced base pair identity is most likely due to haplotype differences. The RefSeq mRNA for this locus, NM_002780, aligns at nearly 100% identity. Paired end sequences from BAC and fosmid clones were aligned using blat with parameter -ooc=11.ooc. End sequences were determined to be correctly located when 7 they were oriented correctly with respect to each other, and they were at a reasonable distance apart. For BAC clones, end sequences must be at least 50Kb but no more than 300Kb apart. For fosmid clones, end sequences must be at least 30Kb but no more than 50Kb apart. A total of 2,489 pairs of BAC end sequences and 10,060 fosmid end sequences could be aligned to chromosome 19. Of the 55,785,651 sequenced bases, 54,381,658 (97.5%) are covered by BAC clones and 55,015,173 (98.6%) are covered by fosmid clones with 55,639,959 (99.7%) covered in the union of these two sets. There are only five instances where there is a break in both fosmid and BAC clone coverage that is not due to a gap in the sequence. In total, these breaks contain 80Kb of sequence. These breaks may simply reveal a lack of coverage in the fosmid and BAC end sequence collections, though it is possible that they point to a polymorphism or deletion in the underlying sequence. Segmental Duplication Analysis We performed a detailed analysis of duplicated sequence (≥90% sequence identity and ≥1 kb in length) comparing the finished chromosome 19 assembly against a recent build of the human genome (Supplementary Information S8 and S9). We then compared duplications detected by this method to a previous analysis of segmental duplications using a whole genome shotgun sequence detection (WSSD) strategy25. There was a good correspondence to chromosome 19 segmental duplications detected by both methods (Supplementary Information S8). 8 Virtually all duplications found by WSSD were supported by the new analysis. Only one small region was identified (41kb, from 7.244 to 7.286Mb) where a highly homologous sequence (identity > 99.5%) was detected but was not predicted by WSSD. Some duplications detected within the pericentromeric region of the p-arm were not supported by WSSD, due to their low sequence homology (identity < 95%) (Supplementary Information S9) which is below detection thresholds. Gene content of duplicated regions was analyzed using a non-redundant/non-overlapping set of known genes (Supplementary Information S10). A gene feature (exon or CDS) was considered duplicated if >50 bp of the feature overlapped duplication. Thus, exons less than 50 bp were lost in this analysis. Additionally, genes required evidence of splicing (2 or more exons). Overall, 5.74% of all coding regions of the non-redundant genes were duplicated. References 1. Carrano, A. V et al. A high-resolution fluorescence-based, semi-automate method for DNA fingerprinting. Genomics 4, 129-136 (1989). 2. Lamerdin, J. E. & A. V. Carrano. Automated fluorescence-based restriction fragment analysis. BioTechniques 15, 294-303 (1993). 3. Trask, B. J. et al. Fluorescence in situ hybridization mapping of human chromosome 19: Cytogenetic band location of 540 cosmids and 70 genes or DNA markers. Genomics 15, 133-145 (1993). 9 4. Trask, B. J. et al. Fluorescence in situ hybridization mapping in interphase chromatin. Genome Mapping and Sequencing, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, p. 261 (1993). 5. Brandriff, B. F et al. Human chromosome 19p: A fluorescence in situ hybridization map with genomic distance estimates for 79 intervals spanning 20 Mbp. Genomics 23, 582-591 (1994). 6. Gordon, L. A et al. A 30-Mb metric fluorescence in situ hybridization map of human chromosome 19q. Genomics 30, 187-192 (1995). 7. Mahadevan, M. et al. Myotonic dystrophy mutation: an unstable CTG repeat in the 3' untranslated region of the gene. Science 255, 1253-1255 (1982). 8. Lenkkeri U. et al. Structure of the gene for congenital nephrotic syndrome of the finnish type (NPHS1) and characterization of mutations. Am. J. Hum. Genet. 64, 51-61 (1999). 9. Lamerdin, J. E. et al. Genomic sequence comparison of the human and mouse XRCC1 DNA repair gene regions. Genomics 25, 547-554 (1994). 10. Lamerdin, J. E. et al. Sequence analysis of the ERCC2 gene regions in human, mouse, and hamster reveals three linked genes. Genomics 34, 399-409 (1996). 11. Olsen, A. et al. Gene organization of the pregnancy-specific glycoprotein region on human chromsome 19: Assembly and analysis of a 700-kb cosmid contig spanning the region. Genomics 23, 659-668 (1994). 10 12. Harvey, T. J et al. Tissue-specific expression patterns and fine mapping of the human kallikrein (KLK) locus on proximal 19q13.4. J. Biol. Chem. 275, 37397-406 (2000). 13. Hoffman, S. M. G., Fernandez-Salguero, P., Gonzalez, F. J. & Mohrenweiser, H.W. Organization and evolution of the cytochromoe P450 CYP2A-2B-2F subfamily gene cluster on human chromosome 19. Molecular Evolution 41, 894-900 (1995). 14. Riethman, H et al. Integration of telomere sequences with the draft human genome sequence. Nature 409, 948-951 (2001). 15. Ewing, B., Hillier, L., Wendl, M. C., and Green. P. Base-calling of automated sequencer traces using Phred. I. accuracy assessment. Genome Res. 8, 175-185 (1998). 16. Ewing, B., and Green, P. Base-calling of automated sequencer traces using Phred. II. error probabilities. Genome Res. 8, 186-194 (1998). 17. Gordon, D., Abajian, C., and Green, P. Consed: A graphical tool for sequence finishing. Genome Res. 8, 195-202 (1998). 18. Kent, W. J. BLAT - The BLAST-Like Alignment Tool. Genome Res. 12, 656- 664 (2002). 19. Schuler, G. D. Electronic PCR: bridging the gap between genome mapping and genome sequencing. Trends Biotechnol. 16, 456-459 (1998). 20. Dib, C. et al. A comprehensive genetic map of the human genome based on 5,264 microsatellites. Nature 380, 152-154 (1996). 11 21. Broman, K. W., Murray, J. C., Sheffield, V. C., White, R. L. & Weber, J. L. Comprehensive human genetic maps: individual and sex-specific variation in recombination. Am. J. Hum. Genet. 63, 861-869 (1998). 22. Deloukas, P. et al. A physical map of 30,000 human genes. Science 282, 744- 746 (1998). 23. Pruit, K. D. & Maglott, D. R. RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res. 29, 137-140 (2001). 24. Strausberg R. L. et al. Generation and initial analysis of more than 15,000 full- length human and mouse cDNA sequences. PNAS 99, 16899-16903 (2002). 25. Bailey, J. A. et al. Recent segmental duplications in the human genome. Science 97, 1003-1007 (2002). 12