Optical Mapping - NYU Computer Science Department

advertisement
Encyclopedia of the Human Genome
August 2001
©Nature Publishing Group, Macmillan Publishers Limited
Optical Mapping
Article Type: Technical
Level: 3
Text Length: 1500
(10 References, cited in the text)
Physical Mapping # Single Molecule Method # Bayesian Inference
Mishra, Bud
Bud Mishra
Courant Institute (NYU) & Cold Spring Harbor Laboratory, New York, USA
Optical Mapping is a physical mapping approach that generates an ordered restriction
map of a DNA molecule (e.g., a genome or a clone). The resulting restriction map is
represented as an ordered enumeration of the restriction sites along with the estimated
lengths of the restriction fragments between consecutive restriction sites and various
related statistics. These statistics accurately model the errors in estimating the restriction
fragment lengths as well as errors due to unrepresented and misrepresented restriction
sites in the map. These physical maps have found applications in improving the accuracy
and algorithmic efficiency of sequence assembly, validating assembled sequences,
characterizing gaps in the assembly and identifying candidates for finishing steps in a
sequencing project, etc. Also, because of its inherent simplicity and scalability, optical
mapping also provides a fast method for moderate resolution karyotyping and
haplotyping.
1
Introduction
The study of genetics relies on complete nucleotide sequences of the organism
together with a description of their structural organization. While this information at its
finest level is not always available, even a sparse collection of fragmentary data with
unrecognized assembly and sequencing errors can be of enormous practical utility when
combined with genomic maps, as these maps provide a natural structure for organizing,
verifying and deciphering the genetic data about chromosomes. Such genomic maps can
be categorized into three classes: genetic maps, physical maps and transcription maps.
Since these maps are naturally related to each other and can be derived, in principle, from
the chromosomal nucleotide sequences, construction of these maps, validation of their
mutual consistency and data integration occupy a central role in genomics and
bioinformatics.
Physical maps are defined as a collection “physical genomic objects” derived
from the genome together with their chromosomal positions. Examples of such “physical
genomic objects” include restriction fragments, cloned fragments, oligonucleotide
probes, etc, or even small unordered subsets of these objects. Frequently physical maps
only provide chromosomal positions in an approximate sense – e.g., relative distances of
each specific object from the others (measured in base pairs – abbreviated, bp’s) or in the
crudest form, just a relative ordering of these objects. For instance, in a BAC map, the
“objects” may take the form of a set of restriction fragments of a BAC (Bacterial
Artificial Chromosome) and the “positional information” may be provided in terms of the
inferred overlaps among these BACs. The quality of a physical map can be provided in
terms of its resolution (measured as density of objects per a unit length, e.g., one
megabase) and its positional accuracy (measured as the deviation of the inferred
positions of objects from their true positions).
Optical mapping is a physical mapping approach that provides an ordered
enumeration of the restriction fragments along the genome. Equivalently, an optical map
can also be described as an ordered enumeration of the restriction sites along with the
estimated lengths of the restriction fragments between consecutive restriction sites. A
restriction site is the location of a short specific nucleotide sequence (4-8 bp long) where
a particular restriction enzyme cleaves the DNA by breaking a phosphodiester bond. The
fragment of DNA generated by cleaving at two consecutive restriction sites is a
restriction fragment. The length of a restriction fragments can be modeled as an
exponentially distributed random variable with an expected length of 4k, if a restriction
enzyme recognizing k-bp long nucleotide sequence is employed. For instance, if an
optical map is created with the restriction enzyme, BamHI – cleaving at every site
containing the 6bp long sequence (GGATCC) – then the resulting map will be composed
of restriction fragments each of expected length 4 Kb. The resolution of such a map can
be estimated to be of the order of 250 restriction sites per a megabase region of the
genome, and can be improved by digesting with several suitably chosen restriction
enzymes out of hundreds of naturally occurring such enzymes.
The physico-chemical approach underlying optical mapping is based on
immobilizing long single DNA molecules on an open glass surface, digesting the
2
molecules on the surface and visualizing the gaps created by restriction activities using
fluorescence microscopy. Thus the resulting image, in the absence of any error, would
produce an ordered sequence of restriction fragments, whose masses can be measured via
relative fluorescence intensity and interpreted as fragment lengths in bps. The corrupting
effects of many independent sources of errors affect the accuracy of an optical map
created from one single DNA molecule, but can be tamed by combining the optical maps
of many single molecules covering completely or partially the same genomic region and
by exploiting accurate statistical models of the error sources. To a rough approximation,
the resolution and accuracy of an optical map can be arbitrarily improved by simply
increasing the number of enzymes and number of molecules, respectively.
In a pioneering experiment in 1993 (Schwartz et al., 1993 and Aston et al., 1999),
Schwartz and his team demonstrated the feasibility of optical mapping by constructing
the maps for several yeast chromosomes. This experiment has been followed by many
subsequent improvements in the physical and chemical processes, accurate mathematical
modeling, efficient algorithmic development and high throughput automation. In
particular, optical mapping has been successfully used to create high-resolution, highaccuracy maps of clones (Cai et al., 1998 and Giacalone et al., 2000) and several wholegenome maps of microorganisms (Jing et al., 1999, Lai et al., 1999 and Lin et al., 1999).
Statistical Models and Bayesian Algorithms for Optical Mapping
The main error sources limiting the accuracy of an optical map are either due to
incorrect identification of restriction sites or due to incorrect estimation of the restriction
fragment lengths. Since these error sources interact in a complex manner and involve
resolution of the microscopy, imaging and illumination systems, surface conditions,
image processing algorithm, digestion rate of the restriction enzyme and intensity
distribution along the DNA molecule, statistical Bayesian approaches are used to
construct an ensemble consensus map from large number of imperfect maps of single
molecules. In the Bayesian approach, the main ingredients are as follows: (1) A model of
the map of restriction sites (Hypothesis, H) and (2) A conditional probability distribution
function for the single molecule map data given the hypothesis (conditional pdf, f(D|H)).
The conditional pdf models the restriction fragment sizing error in terms of a Gaussian
distribution, the missing restriction site event (due to partial digestion) as a Bernoulli trial
and the appearance of false restriction sites as a Poisson process. Using the Bayes’
formula, the posterior conditional pdf f(H|D) = f(D|H) f(H)/f(D) is computed and
provides the means for searching for the best hypothetical model given the set of single
molecule experimental data. Since the underlying hypothesis space is high dimensional
and the distributions are multi-modal, a naïve computational search must be avoided. An
efficient implementation involves approximating the modes of the posterior distribution
of the parameters and accurate local search implemented using dynamic programming
(Anantharaman et al., 1997 and Anantharaman et al., 1999). The correctness of the
constructed map depends crucially on the choice of the experimental parameters (e.g.,
sizing error, digestion rate, number of molecules). Thus, the feasibility of the entire
method can be ensured only by a proper experimental design.
3
The algorithms developed to create optical maps of clones (cosmids, BACs and
YACs) as well as genome-wide maps of microorganisms (Plasmodium falciparum,
Escherichia coli, Deinococcus radiodurans, etc.) involve essentially the same Bayesian
approach and same error models, but vary in how they employ various local
approximations in controlling the algorithmic complexity. The Gentig (GENomic
conTIG) algorithm (Anantharaman et al., 1999) assembles map contigs of genomic maps
from optical maps of long fragments of genomic DNA. Gentig is a greedy algorithm that
in any step considers two map contigs and postulates the best possible way these two
maps can be aligned and estimates the probability that the proposed alignment may be a
“false positive overlap” while accounting for all possible errors. If enough evidence
favors the false positive hypothesis, Gentig rejects the postulated overlap. In the absence
of such evidence, the overlap is accepted and the contigs are fused together. Gentig, thus,
ensures that the map contigs are almost surely correct, and computed efficiently, but
occasionally sacrifices the opportunities to join certain contigs. But since optical mapping
is used with a very high-coverage data, Gentig’s shortcomings have no perceptible
practical effect.
Combining Optical Maps with Sequence Data
Since there is a close relation between the genome-wide sequence data and the
physical map of the genome given by an optical map, ideally, in the absence of any error,
they should align with each other and hence can be used to check their mutual
consistency and for validation. Even in the presence of errors, if the optical map has high
enough resolution and accuracy, then one can infer the best alignment of a sequence to an
optical map, and measure the quality of the local alignment to highlight the regions of
inconsistency. The alignment algorithm is based on a dynamic programming formulation
where the cost function is the log-likelihood of an alignment of a sequence under the
assumption that the optical map only suffers from known sizing error, error due to
missing and false restriction sites.
Thus, optical maps can be useful for validation and annotation (even at an early stage
of sequence assembly), gap detection in sequence assembly and targeted gap closing,
sequence contig phasing and map assisted sequence assembly.
4
TATCCATCGTATCCATTCGTTGCAAAATCCGGGTACTTGCAAATCGCCGAGTCGATTTGCAACCGAGT
Genetic Sequence Data
Single Molecule Optical Map –
without error
Single Molecule Optical Map – with
sizing error
Single Molecule Optical Map – with
missing and false sites
Many imperfect Single Molecule
Optical Maps
TATGTATCCATTCGTTGCAAAATCCGGGTACTTGCAAATCGCNNNNNGATTTGCAACCGAGT
Computed Optical Map – aligned to
available sequence contigs, checking
their mutual consistency
Figure 1 illustrates the key ideas for computing an accurate consensus optical map with a statistical
algorithm. The input to the algorithm is a set of imperfect single molecule optical maps of the same
genomic region.
5
Figure 2 shows a consensus optical map computed by Gentig from a large set of imperfect single
molecule optical maps. The single molecule maps are generated in silico. The maps are visualized
with Courant Bioinformatics Group’s Genscape.
6
Figure 3 shows an optical map aligned to sequence contigs – also visualized with Genscape.
7
References
Anantharaman TS, Mishra B and Schwartz DC (1997) Genomics via Optical Mapping II:
Ordered Restriction Maps. Journal of Computational Biology 4(2): 91-118.
Anantharaman TS, Mishra B and Schwartz DC (1999) Genomics via Optical Mapping
III: Contiging Genomic DNA. Proc. 7th International Conference on Intelligent Systems
for Molecular Biology (ISMB 99) 7:18-27.
Aston C, Mishra B and Schwartz DC (1999) Optical Mapping and its potential for largescale sequencing projects. Trends in Biotechnology 17:297-302.
Cai W, Jing J, Irvin B, Ohler L, Rose E, Shizuya H, Kim UJ, Simon M, Ananthraman T,
Mishra B and Schwartz DC (1998) High-resolution restriction maps of bacterial artificial
chromosomes constructed by optical mapping. Proc. Natl. Acad. Sci. 95:3390-3395.
Giacalone J, Delobette S, Gibaja V, Ni L, Skiadas Y, Qi R, Edington J, Lai Z, Gebauer D,
Zhao H, Anantharaman T, Mishra B, Brown LG, Saxena R, Page DC and Schwartz DC
(2000) Optical Mapping of BAC Clones from the Human Y Chromosome DAZ Locus.
Genome Research 10:1421-1429.
Jing J, Lai Z, Aston C, Lin J, Carucci DJ, Gardner MJ, Mishra B, Anantharaman TS,
Tettelin H, Cummings LM, Hoffman SL, Venter JC, and Schwartz DC (1999) Optical
Mapping of Plasmodium falciparum Chromosome 2. Genome Research 9:175-181.
Lai Z, Jing J, Aston C, Clarke V, Apodaca J, Dimalanta ET, Carucci DJ, Gardner MJ,
Mishra B, Anantharaman TS, Paxia S, Hoffman SL, Venter JC, Huff EJ and Schwartz
DC (1999) A shotgun optical map of the entire Plasmodium falciparum genome. Nature
Genetics 23:309-313.
Lin J, Qi R, Aston C, Jing J, Anantharaman TS, Mishra B, White O, Daly MJ, Minton
KW, Venter JC and Schwartz DC (1999) Whole-Genome Shotgun Optical Mapping of
Deinococcus radiodurans. Science 285:1558-1562.
Samad A, Cai WW, Hu X, Irvin B, Jing J, Reed J, Meng X, Huang J, Huff E, Porter B,
Shenker A, Anantharaman T, Mishra B, Clarke V, Dimalanta E, Eddington J, Hiort C,
Rabbah R, Skiadas J and Schwartz DC (1995) Mapping the genome one molecule at a
time-Optical Mapping. Nature 378: 516-517.
Schwartz DC, Li X, Hernandez LI, Ramnarain SP, Huff EJ and Wang YK (1993)
Ordered restriction maps of Saccharomyces cerevisiae chromosomes constructed by
optical mapping. Science 262:110-114.
8
Download