Biotechnology Homework 4 Fall 2010 Answers 1. (i) One method is to consider groups of STSs on any one BAC and realize that these must be clustered (with the remaining markers to one side or the other). Thus, the clusters are abde abd aefh acefh abde abdg adef aef Pick any pair & start making progress… abde vs abd means e is to one side of a, b and d aefh vs abd means e,f, h are to one side of a, and b, d on the other side abdg vs abde means g and e are either side of abd adef vs abd means d is closer to a than b From the above, must be g b d a e [fh] adef implies f is before h, and acefh vs aefh implies c after h Hence, g b d a e f h c must be the order and BACs minimal extent is defined by included STSs and maximal extent defined by STSs that are absent, giving the map below (left vs right has no meaning here so the mirror image drawing is the same). g 6 2 1 5 7 8 3 4 b d a e f h c ------X---------X--------X-------X----------X---------X-------X----------X--------X-------X--------X---------X--------X-------X--------X--------X------X--------X--------X--------X--------X--------X--------X-------X--------X-------X--------X--------X--------X-------X---------X---- (ii) In this example there was no ambiguity about STS order (i.e. the mapping process orders the BACs and the STS markers at the same time without requiring any prior knowledge). We have no explicit information about the distance between any pair of adjacent STS markers. If we knew the total number of markers and the total genome size we could calculate the average spacing of markers. Since our BACs are around 200kb the above map implies the spacing of the markers is roughly 50kb. In a real mapping project the density of markers might be similar or perhaps twice as high (i.e. at least 50-100,000 STS markers for mapping the human genome, and probably at least 300,000 BACs [3-5 times the density shown here]). Some students were worried about orientation but perhaps did not consider what that means. The only relevant orientation is which end points towards a particular telomere. That only becomes of consequence when really long contigs have been assembled. In the interim it is not an issue as 1 merging of BACs into those really long contigs proceeds independently of knowledge of orientation of each small contig. (iii) The simplest approach would be to march down the chromosome, as in chromosome walking. So, to move to the left we could collect all BACs that have STS g and cluster them together as above, extending the assembly to the left. We could do the same for STS c on the right and continue to do this until we find a problem (a gap). There are many other possibilities but there must be some sort of logical progression (a plan), even if that is eventually converted to a computer program. Of course, we must have many arbitrary starting points (as above, by looking first at STS a) in order to seed multiple long assemblies (because there are several chromosomes and because we will not straight away be able to link all BACs together on any one chromosome). Some answers confused this type of large contig building with putting DNA sequences together into a contig. This project is not about DNA sequencing & hence “primer walking” is not appropriate. The differences are really in scale and practicality. If primer walking were possible it would be extremely slow (traveling less than 1000bp at each step). However, you need a suitable template for primer walking and cannot nominate one when you cannot identify a cloned piece of DNA covering the required region. The question is essentially about identifying that cloned DNA. (iv)(a) There are many possible ways. One is informational (because the DNA sequence of an STS is always known). Even before knowing complete genome sequence there were many studies studying regions of the genome and exploring repetitive sequences. Gradually this built up catalogs of transposable element sequences and of other known repeated sequences. Hence, computers can compare any new sequence to that growing database in order to exclude known repeated sequences. Experimentally (the expected answers), hybridization to genomic DNA could give you the required information. A Southern blot is likely to be very clear (multiple bands with several enzyme digests or stronger bands if tandem repeats not separated by enzyme sites). FISH may show more than one site of hybridization (but not for clustered or tandem repeats). Each of the above is quite a lot of work, so you might consider alternatives. One could use the STS probe on a dot blot and take care to have controls allowing you to measure the strength of the hybridization signal. It is possible also to try a “reverse Southern” where you have a large number of PCR-amplified STSs on a filter and hybridize to labeled genomic DNA. Repeated sequences will give a notably strong signal. You could just take STS primers and amplify from genomic DNA. If you see more than one band you do not have a unique sequence. However, a single band still allows the possibility that you are amplifying several identical sequences from different locations (and that is quite a likely outcome). That could be detected by quantitative PCR but that is a significant extra amount of work when testing many STSs. Each of the above approaches has some validity but the work required to be certain that an STS is unique means that is not always done (and is not a big problem if use of an STS can make clear that it is unsuitable). It was suggested that one could sequence the PCR product from using STS primers on genomic DNA to reveal whether diverse sequences have been amplified even though a single band size is produced. That could work if the repeats were imperfect but if they are perfect then the amplified sequences from different parts of the genome will be identical. Other strategies of sequencing or otherwise testing DNA adjacent to STS sites are good ideas in principle and might have elegant solutions, but in a simple rendering are pretty cumbersome (& not fully worked out in the answers given). (b) If you used an STS that in fact contained repeated sequences you would find that a very large number of BACs scored positive for that STS. Thus, simply looking at the total number of hits for an STS can alert you to a problem. If you then compared this number of hits to those of other 2 markers found on the same set of BACs (picked out by that STS) you would quickly see if only the one STS is over-represented (in which case you should throw out that STS data) or if a unique genomic locus happened to be over-represented. Also, if you tried to assemble all of the BACs picked out by an STS probe you would find that they can be assembled into more than one cluster (from different regions of the genome). (v) Looking at only BAC 1-4 & using the same method as before abde abd aefh acefh so e falls outside abd block, bd & fh either side of ae so (bd) a e (fh) is order of markers with bd and fh order unknown bd 2 1 3 4 a e fh -------------X-------X------------------X-------X--------X-------X-------X--------X---------------X--------X--------X------------- The left-hand end=points of 1 & 2 are not known relative to each other, as for the right-hand endpoints for 3 & 4. By similar arguments g 6 2 1 5 7 8 3 4 d e h ------X-------------------X--------------------------X------------------------X----------------X-------------------X----------------X--------X----------------X---------------------X-------------------------X------------------X-----------------X------------------X------- Relative end-points not clear on left for (2,1,5, 7) and (8,3,4); for right (6,2), (1,5,7,8) & (3,4) In all cases BACs can be arranged in some sort of correct order & overlaps can certainly be found. Thus, the important points are (a) that STS mapping is possible with relatively few BACs (must be sufficient number that they each have at least one overlap at each end) and markers, and (b) greater marker and BAC density will give better resolution of end-points (& of course, more endpoints). It is important to note the positive (a) and the limitations (b), and to be very careful with language. None of the maps with fewer markers or BACs are “incorrect” and when describing them as uncertain or having less confidence there is some ambiguity. It is best to be explicit about exactly what information is lost- precision of knowledge about extent of overlaps. It is also important to 3 remember that the objective of mapping is to order the BAC inserts relative to each other, not to map STS markers. 2. (i) It is obvious that BAC-1 and 3 share almost all fragments, whereas very few fragments beyond the five used to identify this set of BACs are common to any other pair. It is therefore clear that BACs 1 and 3 overlap extensively and come from the same genomic region. For the others you might wonder if they only overlap a little or if they share a set of 6-8 fragments just by coincidence (because size resolution is far from perfect). You could certainly dismiss the idea that all five BACs overlap by trying to merge a third BAC with any pair of merged BACs (they simply do not fit). If (other than 1 and 3) they do not all overlap then it is likely that no pair overlaps since they all have similar numbers of shared fragments. So, the best guess might be that 1 and 3 is the only overlapping pair. You might be more certain of that conclusion if you did some sophisticated calculations or if you had direct experience of this kind of experiment. However, that is not critical here. What is important is to conclude that there is no convincing evidence of BAC1 overlap with any BAC other than BAC3. Hence, in a real mapping experiment you should not try and merge BACs unless they have very extensive overlap of fragments. It is critical for this whole question to appreciate the idea that two fragments of apparently the same size are not necessarily of exactly the same size and therefore not necessarily from the same region of the genome. Non-equivalence of that kind will be extremely common because the number of fragments examined is enormous (maybe about 50 x 200,000) and official resolution (50bp here) is nowhere near single bp resolution. Many answers discussed only BAC1 and 3. Discussion of the other BACs is slightly trickier & so some explicit arguments must be made (beyond the idea that the evidence is not so clear as for BAC-1 & 3) to arrive at a decision. (ii) If the library is dense enough you would hope to find many pairs like BAC1 and 3, which overlap extensively. So, an appropriately conservative strategy would be to take BAC1 and find the BACs with the closest matches in restriction enzyme fingerprint. BAC-3 might be the closest on one side and another BAC might be very close to BAC-1 on the other side (displaying one or two unique fragments not present in BAC-3). You could then (proceeding in each direction outwards) take this new BAC and, separately BAC-3 and find their closest matches and add them to the growing merge. In this way you can keep taking small steps outwards without ever having to question whether overlaps are by co-incidence. For this to be true you really need almost every fragment in a pair of BACs to overlap in size. Some answers mentioned a method that is close to this but, not in my opinion, identical- namely a statistic for the degree of overlap required to call something genuine (depends on genome size and insert sizes). That was used historically in genome projects. It may allow you to proceed with slightly smaller libraries than required by my argument above- where very extensive overlap is required, but not precisely defined. (iii) If you had too few BACs you would not have pairs that overlap over most of their length and you would not be able to make the easy decisions (about genuine overlap) described above. Instead, in most cases, even with the closest pair of BACs (in terms of the number of apparently common restriction fragments) you would be uncertain of whether the BACs genuinely shared overlap. Exactly what the cutoff should be to determine if a match should be accepted is not trivial to determine. It will depend on the accuracy of measuring fragment lengths, the size of the 4 genome and the number of BAC clones (and is determined on those bases for real mapping projects- termed the Sulston cutoff). From the data here (which are not authentic) we argued in (i) that the five BACs cannot all overlap, yet roughly 25% of fragments were in common between pairs of BACs. From that you would say that 25% of similarly sized fragments is not good enough to infer an overlap. Hence, you would require greater overlap of BACs to be able to call an overlap with this method. Notice that this is very different to the analogous answer for STS mapping. STS mapping can work with quite a small library but fingerprinting requires very extensive overlaps, and hence a very big library. You might think that the requirement for a big library means that unnecessary work will be done. However, (a) that type of repetitive mapping work is fairly easy to expand, (b) very large libraries can be made because electroporation of E. coli is so efficient, and (c) the multiple overlapping BACs can be useful in clearly identifying outliers (artifacts) and assuring you that the rest of the BACs are fine and properly aligned. If you wanted to approximate how big the library would be……. if you allowed only 3-4 restr’n fragments to differ that would mean requiring overlaps for all but about 20kb or 10% of BAC size. If they were evenly spaced that means you would need about 10fold coverage of the genome. They won’t be so perfectly spaced so perhaps 20-fold would be about right. (iv) (a) Fingerprint analysis is usually done with a 6-cutter enzyme such as EcoRI or HindIII. Such enzymes cut on average once every 46 bp or roughly every 4kb. Of course, you will occasionally get a 35kb fragment and a 25bp fragment but on average you would expect to see about 50 bands from each BAC and the majority of bands in the 1-10kb range. You would expect to be able to position almost all of the restriction sites (small fragments are ignored) on a map of merged BACs As discussed in Q1 STS markers may well be 20-50kb apart. Hence, you could easily have a deletion in a BAC between two STS markers, which does not delete either marker and would be unnoticed. However, any such deletion would either remove one or more restriction fragments or alter their size in a way that is easily detected. The affected BAC could be merged on the basis of other fragments. The potential artifact described is important to detect. If you happened to pick one such BAC clone as part of a minimal tiling path for sequencing you would end up with the wrong sequence for that portion of the genome (b) The restriction map constructed from overlapping BACs ought to be the same as in the genome from where the BAC fragments were cloned. However, cloning artifacts (changes to the BAC fragments) and mistakes in aligning them can occur. To check for this you could make probes from various segments of BACs (even whole BACs if they contain no repetitive DNA sequences) and use them on genomic Southern blots to see if the genomic restriction fragment sizes match those in your map. A key point, missed by most, is that you can actually make such maps for uncloned genomic DNA, whereas STS maps from uncloned DNA would be hard (perhaps FISH hybridization positions- but these would have very low resolution). (c) Once you have a complete genome sequence you can use that information to construct a map of restriction enzyme sites & compare this with your clone map. This allows you to compare the two maps very accurately (errors in measuring fragment size and ignoring small fragments in fingerprinting account for acceptable differences, as would RFLPs). For STSs the genome sequence could verify the order but that would provide no verification for the space between STSs (95% of the genome). 5 3. (i) It is very important to generate a diversity of starting-points for sequencing. As an absolute minimum you would require starting points to be separated by no more than the length of a single sequence (say 700nt) in order to be sure you get full coverage of any strand. Sau3A sites, which occasionally will be separated by as much as 2-3kb, would make that impossible. In reality you require even more extensive staggering of start-points so that you can cover any region of sequence several times. For BAC alignments the end-points of consecutive BAC inserts can be staggered by many kb without compromising your ability to detect the overlaps (even by restriction enzyme fingerprinting) and so the spacing of Sau3A sites is adequate. (ii) In the first, bulk phase of a shotgun sequencing project you simply take each sequencing read as a separate piece of information. So long as it represents correct contiguous sequence it does not matter where the sequence read came from. In a 2x 3kb composite clone a sequencing read (from each end) would not be long enough to cross the junction between the two inserts so the entire sequence being read should be correct. Although mate-pair information can be crucial to finish a sequence that information is never used for the majority of sequenced clones (but indeed a composite would give mis-leading mate-pair information, perhaps leading you to test an incorrect scaffolding arrangement & attempting to link sequences that cannot be linked. Note, however, that you would not actually link the two sequences incorrectly unless you performed primer walking on the composite clone- which would actually reveal that it is longer than expected). (iii) As explained above, we only use sequence information. We do not actually order plasmid subclones and we do not need to know the entire sequence of any of those plasmid clones. (iv) (a) you should see overlaps among Seq 1-4 and between Seq 5-6 as illustrated below:0 100 300 550 Seq1 ------------------------------------------------------------Seq2 ---------------------------------------------------------------------Seq4 ----------------------------------------------------------------------------Seq3 -----------------------------------------------------------And Seq5 --------------------------------------------------------------------Seq6 ------------------------------------------------------------------------(b) Once you test an alignment you can decide whether to accept it and how to designate a merge. Normally, as indicated in the question, the merge will reflect uncertainties by introducing N where two sequences do not match properly (where N might be a choice of two different nucleotides or a nucleotide vs a space). If you use an N-ridden merge it will not match well further and your final product will in any case contain many Ns, the nature of which will eventually have to be resolved. All this simply re-states the idea in the question that producing too many Ns in a merge is very undesirable. So the simplest idea is to require that any alignment is perfect or near-perfect over its entire length. That way “Poor Seq2” would never be included in any merge. Alternatively, you could require a perfect match over only part of the sequence and you could trim Poor Seq2 so that only the good portion became incorporated. The first idea of simply ignoring a poor sequence read (or actively throwing it out once recognized as giving multiple partial alignments) is almost certainly the 6 most effective. You deliberately create a large excess of sequencing reads in these projects and can afford to reject many sequences if the quality is not high enough. (c) In a real project you would simply keep looking for merges among many more sequences. If you have enough sequences merges will become longer and longer without any special steps or intervention. (d) Connecting requires obtaining new DNA sequence that overlaps the two contigs. An obvious strategy is primer walking, where you design a primer based on sequences near the end of each contig (& pointing into unknown territory). You also need a template. Here, we have a single BAC which is certain to include the required sequences so that can be used directly. In whole genome shotgun sequencing it would be necessary to go back to the plasmid (or BAC) clone templates used to generate sequence reads that are near the ends of the contigs we wish to join. If you looked for mate-pairs (two sequences in different contigs that were read from the two ends of the same plasmid) you could get an idea of how much missing sequence there is and be able to identify a template that could bridge the entire gap. (e) Both ends of plasmid subclones and BAC clones are sequenced and labeled appropriately. Hence, you look for sequences near the end of a contig and ask if the mate pair for that sequence is found on another contig. If it is, then the two contigs represent nearby sequences (separated by less than the size of the insert in the plasmid in question). You could then use that plasmid as template for primer walking, eventually generating sequences that spanned the two contigs. (f) You should see that the new BACs have overlaps with each of the previously assembled contigs and that they overlap each other in the repeated sequences. Potentially therefore these two new BACs can complete one contig and if the repeats are tandem (rather than dispersed) that linkage is reasonable to make. The key point, however, is that we have no idea how many tandem repeats are really present in the parent (BAC) sequence. If there were additional clones with more repeat units that could set a minimum. However, we only can establish a maximum (& the exact number) if we have a sequence that spans the entire set. (g) If one sequence read proceeds from unique sequence through repeats to unique sequence we can define the exact correct sequence. This can only happen if the total length of repeats is less than about 700bp. We might still worry that during generation of that clone there had been some recombination event that eliminated some repeats compared to the original source DNA (so seeing the same result in more than one clone would be re-assuring). 7