FZ4201 Assignment I Part 1 Good against Evil The question of the essay consists of two parts: describe and discuss the methodology. Therefore I will start with the description of the methods and strategies of both parties involved and will end with a discussion. Description The International Human Genome Sequencing Consortium This Consortium was a collaboration of 20 groups from different countries brought together to produce a draft human genome sequence. In order to produce such a draft, a technique called basic shotgun sequencing was considered. But this technique could not be used with repeat-rich genomes such as the human genome as misalignment and misassembly would occur all too frequent. Two solutions were available: Whole-genome shotgun analysis: this technique had been used in the past for the repeat-poor genomes of viruses, bacteria and flies using linking information and computational analysis to avoid misassemblies. Hierarchical shotgun sequencing: large-insert clones (100-200kb). Some of these may suffer rearrangement but this can be reduced by `clone fingerprints`. They decided to use the hierarchical shotgun sequencing technique for several reasons1: 1. After the draft sequence would be complete, the ultimate frequency of misassembly would probably be lower than with the whole-genome approach, in which it would be more difficult to identify regions in which the assembly was incorrect. 2. Heterozygosity and SNP`s can make the assembly more difficult with the whole-genome approach as for hierarchical shotgun sequencing, each large clone is derived from a single haplotype and will not experience these kind of problems. 3. Hierarchical shotgun sequencing would be more able to deal with cloning biases, because it would be easier to sequence under-represented sequences afterwards. 4. Hierarchical shotgun sequencing allowed work and responsibility to be internationally distributed Figure 1. Hierarchical Shotgun Sequencing technique used by the HGP 2. The Human Genome Project (HGP) adopted a `map first-sequence later' approach1 (figure 1). Fragments of DNA up to several thousand base pairs long are produced by restriction enzymes, and inserted into synthetic chromosomes known as bacterial artificial chromosomes (BAC`s). These BAC`s are then grown in batches and subsequently mapped on the genome's chromosomes by looking for distinctive marker sequences, called sequence tagged sites (STS`s), whose location had already been pinpointed. Figure 2. Working backwards from the gel to the genome3. By working backwards (figure2) from the physical map created by the STS`s a fingerprint clone contig was assembled by using a computer program called FPC to analyse the restriction enzyme digestion patterns of the large-insert clones. To minimize overlap between adjacent clones, clones were then chosen of which all of its restriction enzyme fragments were being shared with at least one of its neighbours on each side in the contig. Once these overlapping clones had been sequenced, the set was called a `sequence-clone contig` (figure 3). Figure 3. Assembly done by FPC1. . When all the selected clones from a fingerprint clone contig were sequenced on slabgel or capillary based devices, the sequence-clone contig was the same as the fingerprint clone contig. After all contigs were sequenced, GigAssembler merged them together and tried to order and orient the contigs, hereby creating the draft genome (figure 4). Figure 4. Assembly of the contigs into scaffolds1. Celera Genomics Whole-genome strategy Celera chose a mixed-strategy4: a whole-genome strategy and a regional chromosome assembly, each combining sequence data from Celera and the Human Genome Project whose data was publicly available. The first strategy combined data from both parties in the form of additional synthetic shotgun data, and the second strategy was a compartmentalized assembly process that first divided the Celera and HGP data into scaffolds, localized to larger chromosomal segments which were assembled afterwards. This final step in assembling the draft genome was to order and orient the scaffolds on the chromosomes and was done using the physical mapping markers. It left out the mapping stage with the BAC’s and went straight for subcloning the DNA fragments in plasmids (figure 5). This approach saved time and effort but it would make the assembly more dependent on algorithms and computers. Figure 5. Whole genome shotgun sequencing method2. The reason for leaving the mapping step out was explained by Gene Myers, vicepresident of informatics research with Celera, and James Weber of the Marshfield Medical Research Foundation in Wisconsin. They argued that the reassembly process of the cloned fragments by using algorithms could be applied to cloned random fragments taken from the genome as a whole5. The correct position of these scaffolds on the genome was then worked out using STS`s. Discussion Two researchers Michael Olivier (Stanford) and George Church (Harvard) found that the draft assemblies were similar in size, contain comparable numbers of unique sequences…and exhibit similar statistics6. But because the assembly process was different in both projects, the gap distributions and sizes of the contigs were different and this must have had an impact on the quality of the sequence. As HGP presents more stages of assembly by providing four phases of sequence data compared to Celera, it should have a more accurate sequence. Two articles, published at the same time as Celera and HGP published their articles, discuss several features of the strategies. Incidentally or not, the article in Science7 is in favour of the Celera strategy and vice versa3. Galas7 explains that since Celera used paired-end sequences to link contigs together, they could put them in the right order and orientation. As they used several known sizes of plasmid clones for sequencing and always generated sequence pairs at known distances from each other they could also put the contigs at the right distance from each other, even when there were gaps in between. One weakness in the HGP explained by Olson3 was that there was not always a BAC clone to cover every part of the genome; and overlaps between clones could have been obscured by data errors or the presence of large-scale repeats in the genome. So why did the researchers from the HGP choose for BAC’s? Because the finishing phase is easier: Resolving the internal gaps and discrepancies could be done by subcloning each BAC. Conclusion We could conclude that Celera used more or less the same strategy as the HGP but in reverse order. Celera first sequenced the contigs and mapped them afterwards, while the HGP first mapped the contigs and then assembled them into a genome. Both strategies had advantages and disadvantages and it is difficult to determine which one has obtained the better genome sequence. One thing remains clear though: Two sequences are better than one. References 1. International Human Genome Sequencing Consortium (2001) Initial sequencing and analysis of the human genome. Nature 409, 860 – 921 2. Campbell & Heyer (2002) Sequencing whole genomes. Available from: Http://occawlonline.pearsoned.com/bookbind/pubbooks/bc_mcampbell_genomics_1/ medialib/method/shotgun.html [Accessed 20th October 2004]. 3. Olson, M.V. (2001) The maps: Clone by clone by clone. Nature 409, 816 – 818 4. Venter, J. C. et al. (2001) The Sequence of the Human Genome. Science 291, 1304 - 1351. 5. Weber, J. L. & Myers, E. W. (1997) Human whole-genome shotgun sequencing. Genome Research 7, 401 - 409. 6. Aach, J. et al. (2001) Computational comparison of two draft sequences of the human genome. Nature 409, 856 – 859. 7. Galas, J. (2001) Making sense of the sequence. Science 291, 1257