Supplementary methods

advertisement
Supplementary Methods
The high resolution metric FISH map of chromosome 19
A complete EcoRI restriction map spanning the entire length of the chromosome,
excluding the centromere, provided the foundation for sequencing human chromosome
19. Initially, over 14,000 chromosome 19-specific cosmids were randomly fingerprinted
using a high-resolution, fluorescence-based approach; a series of restriction digests using
two 6-cutters (EcoRI, BglII) followed by three 4-cutters (DdeI, HaeIII, HinfI) generated
50-400 bp fluorescently-labeled fragments that were sized on PE/ABD Model 370A or
373 DNA Sequencers to create unique fingerprints for each clone1. Likelihood ratios
representing the probability of overlap between fingerprints were generated for each
cosmid pair and used in an automated contig assembly algorithm to computationally
establish islands of clonal continuity. These cosmid contigs were validated and refined
by manual assembly of restriction maps using fluorescently-labeled complete digest
EcoRI restriction fragments (0.5 to >40 kb) sized with a Model 362 GeneScanner2.
A unique component of the chromosome 19 mapping effort was the construction of a
high-resolution, metric FISH map. Individual clones representing cosmid contigs were
iteratively mapped by fluorescence in situ hybridization (FISH) through a series of
chromatin targets with increasing resolution. Chromosomal location for selected cosmids
was first established with one-color FISH to metaphase chromosomes3; binned clones
were then ordered relative to each other with two and three-color FISH to metaphase and
interphase nuclei4. Finally, contigs were oriented and distances between clones were
estimated using three-color FISH against high-resolution pronuclear chromatin targets
1
derived from fusing human sperm with hamster eggs5, 6. The metrically resolved scaffold
framework of 216 cosmid reference points defined contig order and orientation, gap
locations and gap sizes to efficiently direct map closure.
Contig extension and gap closure was achieved by hybridization screening end-labeled,
inter-Alu, oligo and overgo probes to high density arrayed cosmid, fosmid, PAC and
BAC libraries, supplemented by limited STS screening of P1s. Hybridization identified
clones were digested and incorporated into restriction maps to verify extensions and gap
closures. These maps were used to select sequencing tiling path clones, validate the
identity of clones queued for sequencing and target sequence closure efforts.
Extensive in house and collaborative efforts yielded a number of studies characterizing
chromosome 19 content.
Studies include but are not limited to the following:
characterization of disease related genes such as myotonic dystrophy7 and chronic
nephrotic syndrome (CNF)8, the DNA repair enzymes XRCC19, ERCC1 and ERCC210
defining the gene family regions for the carcino-embryonic antigen and pregnancyspecific glycoproteins11 and kallikreins12 and the evolution of the cytochrome p-450
(CYP) gene family13.
The mapping history reviewed here generated distinctive attributes to the chromosome 19
effort. Because of the early completion of the restriction map relative to decreasing
sequencing costs, the tiling path was originally picked somewhat parsimoniously such
that numerous small gaps (<2 kb) were purposely left to be closed by alternative methods
2
such as PCR walking. However, the underlying map defined order and orientation of all
sequencing contigs from the start and readily identified substrates for sequence gap
closure. Furthermore, because of the pronuclear-based scaffold FISH map, the metric for
all gaps, including recalcitrant gaps that required extensive hybridization in alternative
libraries, was predetermined. Finally, the monochromosomal source for the cosmid
library, which represents much of the tiling path, provided a single haplotype path in
highly duplicated or repeat-rich regions, thus bypassing many of the problems inherent to
closing these regions using clones from different haplotypes.
Additional tiling set information
Both gaps in the p arm of chromosome 19 are regions of the chromosome where genomic
DNA appears to be unstable in the cloning vectors currently available. BAC and fosmid
clones were identified to span the gaps but all clones sequenced were internally
inconsistent with each other so were not included in the tiling set. Size estimates for
these gaps could not be obtained from the mouse draft assembly due to breaks in synteny
and FISH sizing was considered unreliable due to the unstable nature of the clones. The
gaps were arbitrarily sized at 5kb.
To ensure specificity in both subtelomeric regions of the chromosome, a tiling path was
generated using chromosome 19 specific cosmids. The distance from the most distal
cosmid to the true telomere is estimated to be 11kb for the p-telomere and 5kb for the qtelomere (Harold Riethman, pers. comm.). A 19p-telomere-containing ‘half-YAC’14 has
been identified but has been too unstable to sequence.
3
The boundary between
euchromatin and heterochromatin at the centromere was identified by the presence of
centromere specific alpha satellite repeats. On the p-ward side, although not included in
the current tiling set, it is probable that AC136499 (NT_078103) extends further towards
the centromere due to overlap with a chromosome specific cosmid AC020949 and signal
by FISH. The placement is tenuous however due to the large repeat structure of the clone
and failure to locate a clone that spans to AC073541.
Sequencing and finishing methods
BAC
DNA
was
hydrodynamically
sheared
using
a
Hydroshear
Instrument
(GeneMachines, San Carlos, CA), size selected (3-4kb) and subcloned into the plasmid
vector pUC18. Randomly selected plasmid subclones were sequenced in both directions
using universal primers and BigDye Terminator chemistry to an average sequence depth
of 8x. Sequences were then assembled and edited using the Phred/Phrap/Consed suite of
programs15, 16, 17 . Following manual inspection of the assembled sequences, clones were
finished by resequencing plasmid subclones and by walking on plasmid subclones or the
large insert clone using custom primers. All finishing reactions were performed using
dGTP BigDye Terminator chemistry (Applied Biosystems, Foster City, CA).
Recalcitrant areas or hard gaps were closed with additional sequence data derived from
sequencing with additives, transposon sequencing, small insert shatter libraries or PCR.
Finished clones contain no gaps and are estimated to contain less than one error per
10,000 base pairs. Clones with a very high repeat content or which showed considerable
bias when cloned into the pUC derived vector, had additional 10kb libraries constructed
in an alternate vector with a low copy number.
4
Supplementary Finishing Information
The tiling set of chromosome 19 consists of 860 finished clones. 550 of these clones
were drafted at the Joint Genome Institute and finished at the Stanford Human Genome
Center while 310 clones were drafted and finished at Lawrence Livermore National
Laboratories. The following clones were drafted and/or finished elsewhere.
AC018725
University of Wisconsin, Madison, WI, USA
AC067968
University of Wisconsin, Madison, WI, USA
AC084219
University of Wisconsin, Madison, WI, USA
AC021092
University of Wisconsin, Madison, WI, USA
AC069278
University of Wisconsin, Madison, WI, USA
AC068948
University of Wisconsin, Madison, WI, USA
AC006213
Whitehead Institute/MIT Center for Genome Research
AC093456
Brookhaven National Laboratories, Upton, NY, USA
AF037338
University of Iowa, Iowa City, IA, USA
'Completeness' of the Chromosome 19 sequence
Locations of STS markers are determined using a combination of three methods. First,
available complete marker sequences were aligned using blat version 2418, with
parameter -ooc=11.ooc. These were further filtered to include only the best alignments
with at least 60% coverage. Second, fasta sequences created using primer sequence
information, were aligned using blat version 2 with parameters -tileSize=10 -ooc=10.ooc
5
-minMatch=1 -minScore=1 -minIdentity=75. These were then filtered based on the
number of mismatches and deviance from the reported product size. For cases with no
mismatches, the size was allowed to deviate up to 200 bases. Similarly, combinations of
one mismatch/150 basepairs, two mismatches/50 basepairs, and three mismatches/25
basepairs were used. Third, e-PCR19 was run using primer information with parameters
N=1 M=50 W=5, where N is the number of allowed mismatches, M is the allowed
deviance from the reported product size, and W is the word size. The results of the three
methods were combined with preference being given to full sequence alignments. That
is, in cases where full sequence alignments were found, these were the only placements
reported: otherwise, primer-based locations are reported.
Reported on chromosome 19 are 121 markers in the Genethon genetic map20, 213
markers on the Marshfield genetic map21, and 120 markers on the deCODE genetic map22
with either full sequence or primer information available. In total, there are 215 unique
genetic markers from these maps. Of these, 213 are found in unique locations on
chromosome 19. A single marker, D19S585, is found in two locations approximately
0.5Mb apart. The last marker, D19S724, is found on chromosome 1 using both full
sequence and primer information provided for the marker. This marker only appears on
the Marshfield map, and it is likely represents either an error in the Marshfield map or in
the sequence and primer information associated with the marker.
The order of the markers in the sequence corresponds almost perfectly with that in each
of the genetic maps, actually having perfect correspondence with the deCODE map.
6
Strictly speaking, there are 12 ordering inconsistencies when compared to the Genethon
map, and 20 compared to the Marshfield map, each being the case of a single out of order
marker. Of these 32 inconsistencies, 15 are between markers that differ by less than 1cM
with the largest difference being 2.67cM. The genetic length of chromosome 19 is
109.9cM in the Genethon map, 105.02cM in the Marshfield map, and 109.73cM in the
deCODE map.
Locations of full-length mRNA sequences were determined using blat version 24 with
parameters -q=rna -trimHardA -fine -ooc=11.ooc. Resulting alignments were filtered to
report only the best alignments for each mRNA requiring at least 98% base pair identity.
Full-length mRNA sequences present in RefSeq23 and the Mammalian Genome
Collection24 on May 16th, 2003 were aligned. The actual sequence aligned for each are
those available on August 1, 2003 in GenBank.
A total of 2,055 unique mRNA
sequences representing 1,166 distinct loci could be aligned with at least 95% coverage
and 98% base pair identity. A single mRNA, BC008405, could be aligned at 97.4% base
pair identity. This mRNA encodes the PSG4 (pregnancy specific beta-1-glycoprotein 4)
gene that is annotated as containing two immunoglobulin C-2 type regions, thus the
reduced base pair identity is most likely due to haplotype differences. The RefSeq
mRNA for this locus, NM_002780, aligns at nearly 100% identity.
Paired end sequences from BAC and fosmid clones were aligned using blat with
parameter -ooc=11.ooc. End sequences were determined to be correctly located when
7
they were oriented correctly with respect to each other, and they were at a reasonable
distance apart. For BAC clones, end sequences must be at least 50Kb but no more than
300Kb apart. For fosmid clones, end sequences must be at least 30Kb but no more than
50Kb apart.
A total of 2,489 pairs of BAC end sequences and 10,060 fosmid end sequences could be
aligned to chromosome 19. Of the 55,785,651 sequenced bases, 54,381,658 (97.5%) are
covered by BAC clones and 55,015,173 (98.6%) are covered by fosmid clones with
55,639,959 (99.7%) covered in the union of these two sets. There are only five instances
where there is a break in both fosmid and BAC clone coverage that is not due to a gap in
the sequence. In total, these breaks contain 80Kb of sequence. These breaks may simply
reveal a lack of coverage in the fosmid and BAC end sequence collections, though it is
possible that they point to a polymorphism or deletion in the underlying sequence.
Segmental Duplication Analysis
We performed a detailed analysis of duplicated sequence (≥90% sequence identity and ≥1
kb in length) comparing the finished chromosome 19 assembly against a recent build of
the human genome (Supplementary Information S8 and S9).
We then compared
duplications detected by this method to a previous analysis of segmental duplications
using a whole genome shotgun sequence detection (WSSD) strategy25.
There was a good correspondence to chromosome 19 segmental duplications detected by
both methods (Supplementary Information S8).
8
Virtually all duplications found by
WSSD were supported by the new analysis. Only one small region was identified (41kb,
from 7.244 to 7.286Mb) where a highly homologous sequence (identity > 99.5%) was
detected but was not predicted by WSSD.
Some duplications detected within the
pericentromeric region of the p-arm were not supported by WSSD, due to their low
sequence homology (identity < 95%) (Supplementary Information S9) which is below
detection thresholds.
Gene content of duplicated regions was analyzed using a non-redundant/non-overlapping
set of known genes (Supplementary Information S10). A gene feature (exon or CDS)
was considered duplicated if >50 bp of the feature overlapped duplication. Thus, exons
less than 50 bp were lost in this analysis. Additionally, genes required evidence of
splicing (2 or more exons). Overall, 5.74% of all coding regions of the non-redundant
genes were duplicated.
References
1.
Carrano, A. V et al.
A high-resolution fluorescence-based, semi-automate
method for DNA fingerprinting. Genomics 4, 129-136 (1989).
2.
Lamerdin, J. E. & A. V. Carrano.
Automated fluorescence-based restriction
fragment analysis. BioTechniques 15, 294-303 (1993).
3.
Trask, B. J. et al.
Fluorescence in situ hybridization mapping of human
chromosome 19: Cytogenetic band location of 540 cosmids and 70 genes or DNA
markers. Genomics 15, 133-145 (1993).
9
4.
Trask, B. J. et al. Fluorescence in situ hybridization mapping in interphase
chromatin. Genome Mapping and Sequencing, Cold Spring Harbor Laboratory, Cold
Spring Harbor, NY, p. 261 (1993).
5.
Brandriff, B. F et al. Human chromosome 19p: A fluorescence in situ
hybridization map with genomic distance estimates for 79 intervals spanning 20 Mbp.
Genomics 23, 582-591 (1994).
6.
Gordon, L. A et al. A 30-Mb metric fluorescence in situ hybridization map of
human chromosome 19q. Genomics 30, 187-192 (1995).
7.
Mahadevan, M. et al. Myotonic dystrophy mutation: an unstable CTG repeat in
the 3' untranslated region of the gene. Science 255, 1253-1255 (1982).
8.
Lenkkeri U. et al. Structure of the gene for congenital nephrotic syndrome of the
finnish type (NPHS1) and characterization of mutations. Am. J. Hum. Genet. 64, 51-61
(1999).
9.
Lamerdin, J. E. et al. Genomic sequence comparison of the human and mouse
XRCC1 DNA repair gene regions. Genomics 25, 547-554 (1994).
10.
Lamerdin, J. E. et al. Sequence analysis of the ERCC2 gene regions in human,
mouse, and hamster reveals three linked genes. Genomics 34, 399-409 (1996).
11.
Olsen, A. et al. Gene organization of the pregnancy-specific glycoprotein region
on human chromsome 19: Assembly and analysis of a 700-kb cosmid contig spanning the
region. Genomics 23, 659-668 (1994).
10
12.
Harvey, T. J et al. Tissue-specific expression patterns and fine mapping of the
human kallikrein (KLK) locus on proximal 19q13.4. J. Biol. Chem. 275, 37397-406
(2000).
13.
Hoffman, S. M. G., Fernandez-Salguero, P., Gonzalez, F. J. & Mohrenweiser,
H.W. Organization and evolution of the cytochromoe P450 CYP2A-2B-2F subfamily
gene cluster on human chromosome 19. Molecular Evolution 41, 894-900 (1995).
14.
Riethman, H et al. Integration of telomere sequences with the draft human
genome sequence. Nature 409, 948-951 (2001).
15.
Ewing, B., Hillier, L., Wendl, M. C., and Green. P. Base-calling of automated
sequencer traces using Phred. I. accuracy assessment. Genome Res. 8, 175-185 (1998).
16.
Ewing, B., and Green, P. Base-calling of automated sequencer traces using Phred.
II. error probabilities. Genome Res. 8, 186-194 (1998).
17.
Gordon, D., Abajian, C., and Green, P. Consed: A graphical tool for sequence
finishing. Genome Res. 8, 195-202 (1998).
18.
Kent, W. J. BLAT - The BLAST-Like Alignment Tool. Genome Res. 12, 656-
664 (2002).
19.
Schuler, G. D. Electronic PCR: bridging the gap between genome mapping and
genome sequencing. Trends Biotechnol. 16, 456-459 (1998).
20.
Dib, C. et al. A comprehensive genetic map of the human genome based on
5,264 microsatellites. Nature 380, 152-154 (1996).
11
21.
Broman, K. W., Murray, J. C., Sheffield, V. C., White, R. L. & Weber, J. L.
Comprehensive human genetic maps: individual and sex-specific variation in
recombination. Am. J. Hum. Genet. 63, 861-869 (1998).
22.
Deloukas, P. et al. A physical map of 30,000 human genes. Science 282, 744-
746 (1998).
23.
Pruit, K. D. & Maglott, D. R. RefSeq and LocusLink: NCBI gene-centered
resources. Nucleic Acids Res. 29, 137-140 (2001).
24.
Strausberg R. L. et al. Generation and initial analysis of more than 15,000 full-
length human and mouse cDNA sequences. PNAS 99, 16899-16903 (2002).
25.
Bailey, J. A. et al. Recent segmental duplications in the human genome. Science
97, 1003-1007 (2002).
12
Download