Encyclopedia of the Human Genome August 2001 ©Nature Publishing Group, Macmillan Publishers Limited Optical Mapping Article Type: Technical Level: 3 Text Length: 1500 (10 References, cited in the text) Physical Mapping # Single Molecule Method # Bayesian Inference Mishra, Bud Bud Mishra Courant Institute (NYU) & Cold Spring Harbor Laboratory, New York, USA Optical Mapping is a physical mapping approach that generates an ordered restriction map of a DNA molecule (e.g., a genome or a clone). The resulting restriction map is represented as an ordered enumeration of the restriction sites along with the estimated lengths of the restriction fragments between consecutive restriction sites and various related statistics. These statistics accurately model the errors in estimating the restriction fragment lengths as well as errors due to unrepresented and misrepresented restriction sites in the map. These physical maps have found applications in improving the accuracy and algorithmic efficiency of sequence assembly, validating assembled sequences, characterizing gaps in the assembly and identifying candidates for finishing steps in a sequencing project, etc. Also, because of its inherent simplicity and scalability, optical mapping also provides a fast method for moderate resolution karyotyping and haplotyping. 1 Introduction The study of genetics relies on complete nucleotide sequences of the organism together with a description of their structural organization. While this information at its finest level is not always available, even a sparse collection of fragmentary data with unrecognized assembly and sequencing errors can be of enormous practical utility when combined with genomic maps, as these maps provide a natural structure for organizing, verifying and deciphering the genetic data about chromosomes. Such genomic maps can be categorized into three classes: genetic maps, physical maps and transcription maps. Since these maps are naturally related to each other and can be derived, in principle, from the chromosomal nucleotide sequences, construction of these maps, validation of their mutual consistency and data integration occupy a central role in genomics and bioinformatics. Physical maps are defined as a collection “physical genomic objects” derived from the genome together with their chromosomal positions. Examples of such “physical genomic objects” include restriction fragments, cloned fragments, oligonucleotide probes, etc, or even small unordered subsets of these objects. Frequently physical maps only provide chromosomal positions in an approximate sense – e.g., relative distances of each specific object from the others (measured in base pairs – abbreviated, bp’s) or in the crudest form, just a relative ordering of these objects. For instance, in a BAC map, the “objects” may take the form of a set of restriction fragments of a BAC (Bacterial Artificial Chromosome) and the “positional information” may be provided in terms of the inferred overlaps among these BACs. The quality of a physical map can be provided in terms of its resolution (measured as density of objects per a unit length, e.g., one megabase) and its positional accuracy (measured as the deviation of the inferred positions of objects from their true positions). Optical mapping is a physical mapping approach that provides an ordered enumeration of the restriction fragments along the genome. Equivalently, an optical map can also be described as an ordered enumeration of the restriction sites along with the estimated lengths of the restriction fragments between consecutive restriction sites. A restriction site is the location of a short specific nucleotide sequence (4-8 bp long) where a particular restriction enzyme cleaves the DNA by breaking a phosphodiester bond. The fragment of DNA generated by cleaving at two consecutive restriction sites is a restriction fragment. The length of a restriction fragments can be modeled as an exponentially distributed random variable with an expected length of 4k, if a restriction enzyme recognizing k-bp long nucleotide sequence is employed. For instance, if an optical map is created with the restriction enzyme, BamHI – cleaving at every site containing the 6bp long sequence (GGATCC) – then the resulting map will be composed of restriction fragments each of expected length 4 Kb. The resolution of such a map can be estimated to be of the order of 250 restriction sites per a megabase region of the genome, and can be improved by digesting with several suitably chosen restriction enzymes out of hundreds of naturally occurring such enzymes. The physico-chemical approach underlying optical mapping is based on immobilizing long single DNA molecules on an open glass surface, digesting the 2 molecules on the surface and visualizing the gaps created by restriction activities using fluorescence microscopy. Thus the resulting image, in the absence of any error, would produce an ordered sequence of restriction fragments, whose masses can be measured via relative fluorescence intensity and interpreted as fragment lengths in bps. The corrupting effects of many independent sources of errors affect the accuracy of an optical map created from one single DNA molecule, but can be tamed by combining the optical maps of many single molecules covering completely or partially the same genomic region and by exploiting accurate statistical models of the error sources. To a rough approximation, the resolution and accuracy of an optical map can be arbitrarily improved by simply increasing the number of enzymes and number of molecules, respectively. In a pioneering experiment in 1993 (Schwartz et al., 1993 and Aston et al., 1999), Schwartz and his team demonstrated the feasibility of optical mapping by constructing the maps for several yeast chromosomes. This experiment has been followed by many subsequent improvements in the physical and chemical processes, accurate mathematical modeling, efficient algorithmic development and high throughput automation. In particular, optical mapping has been successfully used to create high-resolution, highaccuracy maps of clones (Cai et al., 1998 and Giacalone et al., 2000) and several wholegenome maps of microorganisms (Jing et al., 1999, Lai et al., 1999 and Lin et al., 1999). Statistical Models and Bayesian Algorithms for Optical Mapping The main error sources limiting the accuracy of an optical map are either due to incorrect identification of restriction sites or due to incorrect estimation of the restriction fragment lengths. Since these error sources interact in a complex manner and involve resolution of the microscopy, imaging and illumination systems, surface conditions, image processing algorithm, digestion rate of the restriction enzyme and intensity distribution along the DNA molecule, statistical Bayesian approaches are used to construct an ensemble consensus map from large number of imperfect maps of single molecules. In the Bayesian approach, the main ingredients are as follows: (1) A model of the map of restriction sites (Hypothesis, H) and (2) A conditional probability distribution function for the single molecule map data given the hypothesis (conditional pdf, f(D|H)). The conditional pdf models the restriction fragment sizing error in terms of a Gaussian distribution, the missing restriction site event (due to partial digestion) as a Bernoulli trial and the appearance of false restriction sites as a Poisson process. Using the Bayes’ formula, the posterior conditional pdf f(H|D) = f(D|H) f(H)/f(D) is computed and provides the means for searching for the best hypothetical model given the set of single molecule experimental data. Since the underlying hypothesis space is high dimensional and the distributions are multi-modal, a naïve computational search must be avoided. An efficient implementation involves approximating the modes of the posterior distribution of the parameters and accurate local search implemented using dynamic programming (Anantharaman et al., 1997 and Anantharaman et al., 1999). The correctness of the constructed map depends crucially on the choice of the experimental parameters (e.g., sizing error, digestion rate, number of molecules). Thus, the feasibility of the entire method can be ensured only by a proper experimental design. 3 The algorithms developed to create optical maps of clones (cosmids, BACs and YACs) as well as genome-wide maps of microorganisms (Plasmodium falciparum, Escherichia coli, Deinococcus radiodurans, etc.) involve essentially the same Bayesian approach and same error models, but vary in how they employ various local approximations in controlling the algorithmic complexity. The Gentig (GENomic conTIG) algorithm (Anantharaman et al., 1999) assembles map contigs of genomic maps from optical maps of long fragments of genomic DNA. Gentig is a greedy algorithm that in any step considers two map contigs and postulates the best possible way these two maps can be aligned and estimates the probability that the proposed alignment may be a “false positive overlap” while accounting for all possible errors. If enough evidence favors the false positive hypothesis, Gentig rejects the postulated overlap. In the absence of such evidence, the overlap is accepted and the contigs are fused together. Gentig, thus, ensures that the map contigs are almost surely correct, and computed efficiently, but occasionally sacrifices the opportunities to join certain contigs. But since optical mapping is used with a very high-coverage data, Gentig’s shortcomings have no perceptible practical effect. Combining Optical Maps with Sequence Data Since there is a close relation between the genome-wide sequence data and the physical map of the genome given by an optical map, ideally, in the absence of any error, they should align with each other and hence can be used to check their mutual consistency and for validation. Even in the presence of errors, if the optical map has high enough resolution and accuracy, then one can infer the best alignment of a sequence to an optical map, and measure the quality of the local alignment to highlight the regions of inconsistency. The alignment algorithm is based on a dynamic programming formulation where the cost function is the log-likelihood of an alignment of a sequence under the assumption that the optical map only suffers from known sizing error, error due to missing and false restriction sites. Thus, optical maps can be useful for validation and annotation (even at an early stage of sequence assembly), gap detection in sequence assembly and targeted gap closing, sequence contig phasing and map assisted sequence assembly. 4 TATCCATCGTATCCATTCGTTGCAAAATCCGGGTACTTGCAAATCGCCGAGTCGATTTGCAACCGAGT Genetic Sequence Data Single Molecule Optical Map – without error Single Molecule Optical Map – with sizing error Single Molecule Optical Map – with missing and false sites Many imperfect Single Molecule Optical Maps TATGTATCCATTCGTTGCAAAATCCGGGTACTTGCAAATCGCNNNNNGATTTGCAACCGAGT Computed Optical Map – aligned to available sequence contigs, checking their mutual consistency Figure 1 illustrates the key ideas for computing an accurate consensus optical map with a statistical algorithm. The input to the algorithm is a set of imperfect single molecule optical maps of the same genomic region. 5 Figure 2 shows a consensus optical map computed by Gentig from a large set of imperfect single molecule optical maps. The single molecule maps are generated in silico. The maps are visualized with Courant Bioinformatics Group’s Genscape. 6 Figure 3 shows an optical map aligned to sequence contigs – also visualized with Genscape. 7 References Anantharaman TS, Mishra B and Schwartz DC (1997) Genomics via Optical Mapping II: Ordered Restriction Maps. Journal of Computational Biology 4(2): 91-118. Anantharaman TS, Mishra B and Schwartz DC (1999) Genomics via Optical Mapping III: Contiging Genomic DNA. Proc. 7th International Conference on Intelligent Systems for Molecular Biology (ISMB 99) 7:18-27. Aston C, Mishra B and Schwartz DC (1999) Optical Mapping and its potential for largescale sequencing projects. Trends in Biotechnology 17:297-302. Cai W, Jing J, Irvin B, Ohler L, Rose E, Shizuya H, Kim UJ, Simon M, Ananthraman T, Mishra B and Schwartz DC (1998) High-resolution restriction maps of bacterial artificial chromosomes constructed by optical mapping. Proc. Natl. Acad. Sci. 95:3390-3395. Giacalone J, Delobette S, Gibaja V, Ni L, Skiadas Y, Qi R, Edington J, Lai Z, Gebauer D, Zhao H, Anantharaman T, Mishra B, Brown LG, Saxena R, Page DC and Schwartz DC (2000) Optical Mapping of BAC Clones from the Human Y Chromosome DAZ Locus. Genome Research 10:1421-1429. Jing J, Lai Z, Aston C, Lin J, Carucci DJ, Gardner MJ, Mishra B, Anantharaman TS, Tettelin H, Cummings LM, Hoffman SL, Venter JC, and Schwartz DC (1999) Optical Mapping of Plasmodium falciparum Chromosome 2. Genome Research 9:175-181. Lai Z, Jing J, Aston C, Clarke V, Apodaca J, Dimalanta ET, Carucci DJ, Gardner MJ, Mishra B, Anantharaman TS, Paxia S, Hoffman SL, Venter JC, Huff EJ and Schwartz DC (1999) A shotgun optical map of the entire Plasmodium falciparum genome. Nature Genetics 23:309-313. Lin J, Qi R, Aston C, Jing J, Anantharaman TS, Mishra B, White O, Daly MJ, Minton KW, Venter JC and Schwartz DC (1999) Whole-Genome Shotgun Optical Mapping of Deinococcus radiodurans. Science 285:1558-1562. Samad A, Cai WW, Hu X, Irvin B, Jing J, Reed J, Meng X, Huang J, Huff E, Porter B, Shenker A, Anantharaman T, Mishra B, Clarke V, Dimalanta E, Eddington J, Hiort C, Rabbah R, Skiadas J and Schwartz DC (1995) Mapping the genome one molecule at a time-Optical Mapping. Nature 378: 516-517. Schwartz DC, Li X, Hernandez LI, Ramnarain SP, Huff EJ and Wang YK (1993) Ordered restriction maps of Saccharomyces cerevisiae chromosomes constructed by optical mapping. Science 262:110-114. 8