Title: Polony haplotyping of individual human chromosome molecules. One sentence summary: Multiplex polony amplifications and parallel single-base identification in single molecules of human chromosomes permits up to 150-megabaserange sequence analyses for a wide variety of genomic applications Kun Zhang1, Jun Zhu1,2, Jay Shendure1, Gregory J. Porreca1, John D. Aach1, Robi D. Mitra3, George M. Church1 1. Department of Genetics, Harvard Medical School, Boston, MA 02115 2. Institute of Genome Science and Policy, Duke University, Durham, NC 27708 3. Department of Genetics, Washington University, St. Louis, MO 63108 Correspondence should be addressed to: G.M.C (g1m1c1@receptor.med.harvard.edu) 1 ABSTRACT We report a long-range multi-locus haplotyping method based on DNA polony technology. By immobilizing thousands of intact chromosomes within a polyacrylamide gel on a microscope slide and performing multiple amplifications from single molecules (MASMO), we determined 5-locus haplotypes spanning a 153 Mb region of human chromosome 7, and found evidence of rare mitotic recombination events in human lymphocytes. Furthermore, the parallel nature of DNA polony technology allows efficient haplotyping on pooled DNAs from a population with one slide. The patterns of linkage disequilibrium (LD) established by haplotyping from pooled DNA are more accurate than statistically inferred haplotypes. 2 TEXT Haplotypes are important for mapping human disease genes, diagnosing loss of heterozygosity in cancer, investigating cis-effects in gene expression, and determining long range order in the face of DNA repeats, yet current molecular haplotyping technologies are inefficient and limited to a short distance, usually less than 100 kb(1-4). Therefore, most association studies rely on haplotypes reconstructed from genotypes using statistical methods(5-8), without taking account of the uncertainty in inference. The variance of inferred haplotype frequencies often compromises the power of haplotype analyses. For long-range (>100 kb) haplotypes, the accuracy of statistical methods has not been rigorously evaluated because of the difficulty in experimentally determining long-range haplotypes. Construction of somatic cell hybrids(9) is the only experimental approach available, but this is only practical on a small sample size(10). In this work, we developed an efficient method for multi-locus long-range haplotyping (Figure 1). DNA polony (polymerase colony) technology(11) has been used to generate parallel and independent PCR amplifications of thousands or tens of thousands of template molecules in a thin layer (~40µm) of polyacrylamide attached to a standard microscope slide. The localized clonal amplicons (polonies) can subsequently be queried in parallel by in situ sequencing(12) or probing(13). The polony approach is particularly well-suited to haplotyping because the template molecules initially immobilized in the gel can be very long, possibly entire chromosomes, and multiple regions of the molecule can be amplified to form overlapping polonies that are spatially segregated from those of other templates. SNPs in these regions can then be interrogated by single base extension 3 (SBE) from adjacent primers with labeled nucleotides, allowing the template's haplotype to be read directly over its entire length. The throughput of polony haplotyping is at least three orders of magnitude higher than other molecular haplotyping technologies such as M1-PCR(2). In a previous pilot experiment, we demonstrated polony haplotyping on two SNPs separated by 45 kb region(3). Here we significantly extend the range of the technique, demonstrate its application to pooled DNA samples, and compare the polony haplotyping method to statistical haplotype inference methods. By only requiring single base differences over vast distances this method secures continuity bridging large repeats and invariant sequences and hence extends methods in a companion paper (Shendure et al, submitted) which links paired-end sequence tags at known, yet shorter distances (i.e. kbp). Two major technical challenges must be overcome to efficiently perform chromosomewide haplotyping. First, haplotyping on multiple loci across entire chromosomes requires multiplexed PCR on single DNA molecules. We previously demonstrated amplification of two loci from a single DNA molecule(3), but amplifying more than two amplicons from one template molecule has not been demonstrated. We wanted the multiplexed amplification to be highly efficient because the number of overlapping polonies is proportional to the co-amplification efficiency with a given number of template molecules. In addition, the amplification needs to be highly specific to the target regions in the presence of complex genome, because non-specific amplification can compete with target-specific amplification on the limited amount of reagents within the polyacrylamide gel and lead to poor signal. We developed a computational procedure to simulate 4 multiplexed amplification in order to quantitatively evaluate non-specific amplification across the entire genome. This procedure allowed us to select primers that amplified target regions at least 10,000-fold more efficiently than the sum of all non-specific amplicons at each PCR cycle. The second technical challenge is to efficiently amplify large DNA molecules in a manner that results in overlapping polonies. This is difficult because, on one hand, large DNA molecules must be packed in an extremely compact form and immobilized within the polyacrylamide gel prior to amplification in order to generate overlapping polonies. On the other hand, condensed long DNA molecules need to be sufficiently “loosened” after immobilization to allow for efficient amplification. Towards this end, we prepared human metaphase chromosomes in which long DNA molecules 40-300 Mb in length were highly-condensed into a form less than 1µm in size. Initially, polony amplification proved difficult using standard protocols(11). We developed a method to make the DNA molecules accessible to DNA polymerase and PCR primers by first rupturing chromosome structure with the expansion force of water quickly freezing and then removing chromosome binding proteins with proteinase digestion. These improvements enabled us to perform haplotyping on 5 SNPs spanning most of human chromosome 7 (153 of 158 Mb). We selected these SNPs so that the haplotypes can be unambiguously determined based on their genotypes and associated pedigree information (Figure 2a). By stacking polony images of different SNP assays on top of each other, we obtain polony clusters, each representing a multi-locus haplotype of a 5 single DNA molecule (chromosome). We observed correct haplotypes on single chromosomes in both the two-locus analysis (Figure 2b) and the multi-locus analysis (Figure 2c). In one polony slide, we found polony clusters from 414 chromosomes that give haplotype reads of at least three SNPs. 404 of these haplotypes were compatible with the expected haplotypes based on the pedigree, including 11 complete haplotypes of all 5 SNPs. 10 observed haplotypes (2.5%) were incompatible with the true haplotypes. We manually analyzed the corresponding polony images. Three (0.7%) were clearly due to errors in the polony identification algorithm arising from segmentation failures. We observed good morphology and strong signals on the other 7 polony clusters. One hypothesis to explain these unexpected haplotypes is that they are artifacts due to random overlaps of two chromosome fragments or uncondensed DNAs. If this is the case, one should not expect to observe pairs of complementary haplotypes. In fact we observed two pairs of complementary haplotypes (Figure 2d). The most possible explanation for these unexpected haplotypes is rare mitotic recombination. We have shown that multi-locus haplotypes can be determined accurately when one sample is assayed with a polony slide. To take full advantage of the highly parallel nature of polony amplification, we evaluated a more efficient strategy by haplotyping pooled samples. Genomic DNAs (or intact chromosomes) from numerous individuals were mixed in equal ratios, and haplotypes were read from thousands of single molecules from a single slide in one experiment. We focused on the CD36 (14 SNPs in 26 kb) and NOS3 (9 SNPs in 19 kb) that were completely re-sequenced in the SeattleSNP project. We did haplotyping on the pooled genomic DNA of the same 24 individuals in the HD50AA 6 panel (Supp.Table 1). Compared to haplotyping of individual DNA, where (except for rare recombination events) the two haplotypes of an individual are clearly revealed by hundreds to thousands of 99% accurate polony reads, haplotyping of pooled DNA does not always allow all haplotypes on a slide to be determined unambiguously. Therefore, for purposes of evaluation, we also experimentally determined all 48 haplotypes among the 24 individuals on the CD36 gene by performing polony haplotyping one sample in a slide (Supp.Table 2). We used these 48 haplotypes as reference haplotypes to characterize the accuracy of the pooled method, and compare this method to statistical haplotype inference methods. We performed multi-locus haplotype inference from genotypes using two algorithms, PHASE(6), which is based on a coalescent framework, and SNPHAP, which is an implementation of the EM algorithm. Since all pair-wise LD statistics are defined on two-locus haplotypes, we also did calculation with a third method, reconstructing two-locus haplotypes for each pair of SNPs from genotypes using the EM algorithm. From the pooled samples, we obtained on average 19,556 polony reads at CD36 and 16,313 at NOS3 in one polony slide. For each pair of SNPs, we observed on average 275 two-locus haplotype reads at CD36 and 97 reads at NOS3. We often observed partial haplotypes with one or more SNPs missing (Supp.Figure 1a and 1b), because amplification from single molecules is not 100% efficient. We focused on polony clusters (haplotypes) with good calls for at least 4 SNPs. Of a total of 3900 such haplotype reads in the CD36 gene, only 81 (2.07%) were incompatible with true haplotypes. In contrast, 4 of 48 (8.3%) haplotypes inferred by PHASE were incorrect - all of them were rare 7 haplotypes occurring only once in the population. Similarly, all the 3 incorrect haplotypes inferred by SNPHAP occur once in the population. However, the haplotype with the lowest frequency (0.2%) is actually a true haplotype. These results indicate that current multi-locus haplotype inference methods are not reliable in predicting rare haplotypes, which could be due to the effect of small sample size. We next performed two-locus haplotype analysis and established the LD patterns in CD36 and NOS3. Two commonly used LD statistics, |D’| and r2, determined by the pooled polony haplotyping method are highly consistent to the values calculated from reference haplotypes (Pearson R2: 0.892 for |D’|, and 0.974 for r2). In contrast, all the three haplotype inference methods give good estimates of r2 values, but very poor estimates of |D’| values (Table 1). These observations show that the pooled polony haplotyping method is superior to haplotype inference methods in establishing LD patterns. Interestingly, the two-locus EM method actually gives slightly more accurate estimates of LD statistics than PHASE and SNPHAP at both CD36 (Table 1) and NOS3 (Supp.Table 3). To take advantage of this finding, we developed a computer program (Genotype2LDBlock) for LD and haplotype block analyses using unphased genotypes. The complexity of this algorithm is approximately O(n) when dealing with thousands or tens of thousands of SNPs. The efficiency of the algorithm is very important, given that recent advances in genome-wide genotyping technologies have permitted large-scale genotyping a reasonably low cost(14, 15). Besides LD statistics, we also did comparisons based on the mean absolute errors (MAEs) of frequencies for all two-locus haplotypes. 8 The pooled polony haplotyping method gives the lowest MAE, although the differences among all four methods are within 2-fold (Table 2). Finally, as the polony technology allows haplotyping of a large DNA pool in one experiment, we sought to study the effect of population size on LD analyses by typing a complete set of 50 individuals in the African American human variation panel (HD50AA). We observed weaker LD at both CD36 and NOS3 in this larger population, although the overall patterns are similar (Supp. Figure 2). This suggests that the level of LD as well as the size of haplotype blocks can be inflated when the population size is small, because rare haplotypes are unlikely to be observed in a small sample size. This finding is consistent with a previous simulation study based on haplotypes generated with a coalescent model(16). In this report, we have implemented a polony haplotyping technology that can determine multi-locus haplotypes on templates ranging from intact chromosomes to genomic DNA from single individuals or pooled samples. We were able to obtain correct 5-locus haplotypes spanning a 153 megabase region of human chromosome 7. This is a dramatic improvement over current molecular haplotyping methods, which are limited to only two SNPs separated by physical distances three orders of magnitude smaller. We have demonstrated that the polony haplotyping technology can be implemented in a highly efficient manner. The efficiency comes from two aspects: (i) parallel haplotyping on thousands of single molecules within a single slide allows us to use pooled DNA from a population; (ii) multi-locus haplotyping on a single slide is more efficient than two-locus 9 haplotyping methods such as AS-PCR. In the case of the CD36 gene, we were able to obtain 275 two-locus haplotype reads for each pair of SNPs. This means, for a population size of 50 individuals, a 3-fold redundancy can be achieved with a single polony slide. A 14-locus haplotyping experiment on 50 individuals is equivalent to 2600 AS-PCR assays, an improvement of more than three orders of magnitude. Polony haplotyping is also very cost efficient. The cost of a pooled experiment on 25 individuals is ~$3 per SNP per individual. But it goes down rapidly to ~$0.25 when the sample size increases to 1000 individuals, because acrydite-modified amplification primers account for the majority of cost in a small scale experiment. As was shown above, LD levels can be inflated when the analyses are based on a small sample size. Reliable estimates of LD require large sample sizes. As one can easily handle 10~20 polony slides in an experiment, the polony haplotyping technology is especially efficient in dealing with large sample sizes. One limitation of polony haplotyping is that most haplotype reads are partial haplotypes because the amplification efficiencies from single molecules never reach 100%. This is not an issue when typing on a single individual in a slide. The two complete haplotypes of an individual can be unambiguously determined from all combination of two-locus haplotypes. When typing a pooled sample, one solution is to combine polony haplotyping with haplotype inference methods(17), especially ones based on a Bayesian model. The accuracy of haplotype inference will be improved by constructing prior distributions based on all partial haplotype reads obtained in polony haplotyping experiments. In addition, haplotypes can be assigned to each individual in the population when genotypes are available. Another limitation is that the pooled polony haplotyping method is not as 10 accurate as haplotyping on each individual separately. Potential sources of the error include: (i) errors in DNA quantification, so that the pooled sample does not consist of equal amount of DNA from each individual, (ii) false-positive or false negative polonies due to the polony identification algorithm, (iii) weak signals in SBE assays on some alleles, and (iv) random sampling errors. The error is likely to be reduced by further optimization of the experimental protocols. The third limitation is that polony haplotyping is less efficient to scale up in the number of SNPs than the number of individuals in one assay, because typing on each SNP requires one SBE assay and polyacrylamide gels are likely to break after ~25 SBE assays. Gel life can be improved by further automation of SBE assay. At the current stage of technology development, this technology is most appropriate for candidate gene based studies. It was recently reported that, in the two-stage study design, the error rates of haplotypes reconstruction for tagging SNPs is significantly higher than from a full set of SNPs among which these tagging SNPs were selected(18). The polony haplotyping technology is particular useful for establishing the patterns of LD between tagging SNPs experimentally in the case and control populations in the second stage of association studies. One unexpected finding in our chromosome-wide haplotyping experiment is that the method could be a good assay to detect rare mitotic recombination events. Mitotic recombination leads to the changes of haplotypes for heterozygous loci. Because of the parallel nature of the polony haplotyping technology, it is well-suited for detecting a small number of rare haplotypes within a population of common haplotypes. Although occurring less frequently than meiotic recombination, mitotic recombination has been 11 observed in many eukaryotes, including yeast(19), mouse(20) and human(21), and has implications in many human diseases, such as neurofibromatosis type 1(22) and ataxiatelangiectasia(21). Polony haplotyping allows detection of mitotic recombination events across the whole genome in unmodified cells, and subsequently mapping crossovers to a resolution that is limited only by the availability of SNPs. This technology makes it possible to study the intensity and distribution of mitotic recombination events in different stages of tumor development, through which the mechanisms of loss-ofheterozygosity (LOH) during tumor initiation and progression could be unveiled. Furthermore, studies on meiotic recombination hotspots have attracted considerable interest recently(23, 24). Most of these studies rely on single sperm genotyping, which is difficult to scale up. Polony haplotyping can overcome the limitation associated with single sperm genotyping(25). Because the uncertainty of statistically inferred haplotypes may result in significant power loss in association studies(26), and because current haplotyping methods are inefficient and expensive, recent efforts have been made either to develop statistical strategies to account for the variance in the frequencies of inferred haplotypes(27), to improve the accuracy of haplotype inference with pedigree information(28), or to conduct association studies with genotypes instead of haplotypes(26). However, if determined accurately and efficiently, haplotypes still provide several advantages over genotypes(29, 30). The polony haplotyping method makes haplotyping significantly more efficient and less costly than conventional methods. This technology could potentially change the cost structure, and catalyze more theoretical research on efficient study designs to achieve 12 maximal power in association studies with fewer concerns about the uncertainty of haplotype inference or the monetary and labor issues in obtaining accurate haplotypes. References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. S. Michalatos-Beloin, S. A. Tishkoff, K. L. Bentley, K. K. Kidd, G. Ruano, Nucleic Acids Res 24, 4841 (Dec 1, 1996). C. Ding, C. R. Cantor, Proc Natl Acad Sci U S A 100, 7449 (Jun 24, 2003). R. D. Mitra et al., Proc Natl Acad Sci U S A 100, 5926 (May 13, 2003). A. T. Woolley, C. Guillemette, C. Li Cheung, D. E. Housman, C. M. Lieber, Nat Biotechnol 18, 760 (Jul, 2000). A. G. Clark, Mol Biol Evol 7, 111 (Mar, 1990). M. Stephens, N. J. Smith, P. Donnelly, Am J Hum Genet 68, 978 (Apr, 2001). T. Niu, Z. S. Qin, X. Xu, J. S. Liu, Am J Hum Genet 70, 157 (Jan, 2002). Z. S. Qin, T. Niu, J. S. Liu, Am J Hum Genet 71, 1242 (Nov, 2002). J. A. Douglas, M. Boehnke, E. Gillanders, J. M. Trent, S. B. Gruber, Nat Genet 28, 361 (Aug, 2001). N. Patil et al., Science 294, 1719 (Nov 23, 2001). R. D. Mitra, G. M. Church, Nucleic Acids Res 27, e34 (Dec 15, 1999). R. D. Mitra, J. Shendure, J. Olejnik, O. Edyta Krzymanska, G. M. Church, Anal Biochem 320, 55 (Sep 1, 2003). J. Zhu, J. Shendure, R. D. Mitra, G. M. Church, Science 301, 836 (Aug 8, 2003). D. A. Hinds et al., Science 307, 1072 (Feb 18, 2005). K. L. Gunderson, F. J. Steemers, G. Lee, L. G. Mendoza, M. S. Chee, Nat Genet 37, 549 (May, 2005). N. Wang, J. M. Akey, K. Zhang, R. Chakraborty, L. Jin, Am J Hum Genet 71, 1227 (Nov, 2002). A. G. Clark et al., Am J Hum Genet 63, 595 (Aug, 1998). J. Forton et al., Am J Hum Genet 76, 438 (Mar, 2005). F. Prado, F. Cortes-Ledesma, P. Huertas, A. Aguilera, Curr Genet 42, 185 (Jan, 2003). C. A. Hendricks et al., Proc Natl Acad Sci U S A 100, 6325 (May 27, 2003). M. S. Meyn, Science 260, 1327 (May 28, 1993). H. Kehrer-Sawatzki et al., Am J Hum Genet 75, 410 (Sep, 2004). W. Winckler et al., Science 308, 107 (Apr 1, 2005). A. J. Jeffreys, R. Neumann, M. Panayi, S. Myers, P. Donnelly, Nat Genet 37, 601 (Jun, 2005). L. Kauppi, A. J. Jeffreys, S. Keeney, Nat Rev Genet 5, 413 (Jun, 2004). A. P. Morris, J. C. Whittaker, D. J. Balding, Am J Hum Genet 74, 945 (May, 2004). P. Kraft, D. G. Cox, R. A. Paynter, D. Hunter, I. De Vivo, Genet Epidemiol 28, 261 (Apr, 2005). M. T. Schouten, C. K. Williams, C. S. Haley, Genetics (Jun 8, 2005). 13 29. 30. J. Akey, L. Jin, M. Xiong, Eur J Hum Genet 9, 291 (Apr, 2001). A. G. Clark, Genet Epidemiol 27, 321 (Dec, 2004). Acknowldegments. We thank Nick Reppas, Joshua Akey and Li Jin for critical review of the manuscript. This work was supported by a NHGRI-CEGS grant. 14 Figures Figure 1 Figure 2a 15 Figure 2b Figure 2c 16 Figure 2d 17 Tables Table 1. Correlation (Pearson R2) of the |D’| (bottom-left triangle) and r2(top-right triangle) values in the CD36 gene determined by different methods. *True haplotypes were determined experimented by polony haplotyping on genomic DNA from each individual. True haplotypes Pooled polony (24) PHASE SNPHAP Two-locus EM True haplotypes - 0.974 0.963 0.959 0.971 Pooled polony (24) 0.892 - 0.958 0.967 0.954 PHASE 0.306 0.368 - 0.980 0.982 SNPHAP 0.388 0.417 0.808 - 0.972 Two-locus EM 0.396 0.419 0.648 0.681 - Table 2. Comparison of MAEs of two-locus haplotype frequencies among different methods. Pooled polony (24) PHASE SNPHAP Two-locus EM MAE 0.0158 0.0232 0.0191 0.0213 18 Figure legends Figure 1. Multi-locus long-range haplotyping. Limited amounts of genomic DNA or intact chromosomes from individuals or pools or individuals were trapped within polyacrylamide gel on the surface of a standard microscope slide. Multiplexed PCR amplifications were performed in parallel on each of the template molecule in situ using standard polony procedures. Genotypes/haplotypes were then queried by multiple round of single base extension assays. Figure 2. Haplotyping on intact chromosomes. (a) The two 5-locus haplotypes in GM10835 can be unambiguously determined with pedigree information. (b) Haplotypes were determined based on overlapped polonies. Polonies representing the four alleles of the two SNPs are color-coded (rs3778973: C=blue, T=green; rs4747028: C=yellow, T=red). Overlapping polonies represent haplotypes (left panel: TT=yellow; mid panel: CT=purple; right panel: CC=white). We observed 137 yellow polonies and 131 white polonies in one polony slide. These are correct haplotype reads. In contrast, there are only 3 instances of alternative haplotypes, which may be due to mitotic recombination. (c) Polony clusters corresponding to the correct 5-locus haplotypes. For each haplotype, the top image stripe is the composite intensity polony image, and the bottom stripe is the polony objects identified by our software. The five frames in each image stripe represent the exact same region of polony images for the five SNPs. (d) Two pairs of recombinant haplotypes that are likely resulting from a single mitotic recombination event. 19