Answers #4

advertisement
Biotechnology
Homework 4
Fall 2010
Answers
1. (i) One method is to consider groups of STSs on any one BAC and realize that these must be
clustered (with the remaining markers to one side or the other).
Thus, the clusters are
abde
abd
aefh
acefh
abde
abdg
adef
aef
Pick any pair & start making progress…
abde vs abd means e is to one side of a, b and d
aefh vs abd means e,f, h are to one side of a, and b, d on the other side
abdg vs abde means g and e are either side of abd
adef vs abd means d is closer to a than b
From the above, must be g b d a e [fh]
adef implies f is before h, and acefh vs aefh implies c after h
Hence, g b d a e f h c must be the order and BACs minimal extent is defined by included STSs and
maximal extent defined by STSs that are absent, giving the map below (left vs right has no
meaning here so the mirror image drawing is the same).
g
6
2
1
5
7
8
3
4
b
d
a
e
f
h
c
------X---------X--------X-------X----------X---------X-------X----------X--------X-------X--------X---------X--------X-------X--------X--------X------X--------X--------X--------X--------X--------X--------X-------X--------X-------X--------X--------X--------X-------X---------X----
(ii) In this example there was no ambiguity about STS order (i.e. the mapping process orders the
BACs and the STS markers at the same time without requiring any prior knowledge). We have no
explicit information about the distance between any pair of adjacent STS markers. If we knew the
total number of markers and the total genome size we could calculate the average spacing of
markers. Since our BACs are around 200kb the above map implies the spacing of the markers is
roughly 50kb. In a real mapping project the density of markers might be similar or perhaps twice as
high (i.e. at least 50-100,000 STS markers for mapping the human genome, and probably at least
300,000 BACs [3-5 times the density shown here]).
Some students were worried about orientation but perhaps did not consider what that means. The
only relevant orientation is which end points towards a particular telomere. That only becomes of
consequence when really long contigs have been assembled. In the interim it is not an issue as
1
merging of BACs into those really long contigs proceeds independently of knowledge of orientation
of each small contig.
(iii) The simplest approach would be to march down the chromosome, as in chromosome walking.
So, to move to the left we could collect all BACs that have STS g and cluster them together as
above, extending the assembly to the left. We could do the same for STS c on the right and
continue to do this until we find a problem (a gap). There are many other possibilities but there
must be some sort of logical progression (a plan), even if that is eventually converted to a computer
program. Of course, we must have many arbitrary starting points (as above, by looking first at STS
a) in order to seed multiple long assemblies (because there are several chromosomes and
because we will not straight away be able to link all BACs together on any one chromosome).
Some answers confused this type of large contig building with putting DNA sequences together into
a contig. This project is not about DNA sequencing & hence “primer walking” is not appropriate.
The differences are really in scale and practicality. If primer walking were possible it would be
extremely slow (traveling less than 1000bp at each step). However, you need a suitable template
for primer walking and cannot nominate one when you cannot identify a cloned piece of DNA
covering the required region. The question is essentially about identifying that cloned DNA.
(iv)(a) There are many possible ways. One is informational (because the DNA sequence of an STS
is always known). Even before knowing complete genome sequence there were many studies
studying regions of the genome and exploring repetitive sequences. Gradually this built up
catalogs of transposable element sequences and of other known repeated sequences. Hence,
computers can compare any new sequence to that growing database in order to exclude known
repeated sequences.
Experimentally (the expected answers), hybridization to genomic DNA could give you the required
information. A Southern blot is likely to be very clear (multiple bands with several enzyme digests or
stronger bands if tandem repeats not separated by enzyme sites). FISH may show more than one
site of hybridization (but not for clustered or tandem repeats). Each of the above is quite a lot of
work, so you might consider alternatives. One could use the STS probe on a dot blot and take care
to have controls allowing you to measure the strength of the hybridization signal. It is possible also
to try a “reverse Southern” where you have a large number of PCR-amplified STSs on a filter and
hybridize to labeled genomic DNA. Repeated sequences will give a notably strong signal.
You could just take STS primers and amplify from genomic DNA. If you see more than one band
you do not have a unique sequence. However, a single band still allows the possibility that you are
amplifying several identical sequences from different locations (and that is quite a likely outcome).
That could be detected by quantitative PCR but that is a significant extra amount of work when
testing many STSs. Each of the above approaches has some validity but the work required to be
certain that an STS is unique means that is not always done (and is not a big problem if use of an
STS can make clear that it is unsuitable).
It was suggested that one could sequence the PCR product from using STS primers on genomic
DNA to reveal whether diverse sequences have been amplified even though a single band size is
produced. That could work if the repeats were imperfect but if they are perfect then the amplified
sequences from different parts of the genome will be identical. Other strategies of sequencing or
otherwise testing DNA adjacent to STS sites are good ideas in principle and might have elegant
solutions, but in a simple rendering are pretty cumbersome (& not fully worked out in the answers
given).
(b) If you used an STS that in fact contained repeated sequences you would find that a very large
number of BACs scored positive for that STS. Thus, simply looking at the total number of hits for
an STS can alert you to a problem. If you then compared this number of hits to those of other
2
markers found on the same set of BACs (picked out by that STS) you would quickly see if only the
one STS is over-represented (in which case you should throw out that STS data) or if a unique
genomic locus happened to be over-represented.
Also, if you tried to assemble all of the BACs picked out by an STS probe you would find that they
can be assembled into more than one cluster (from different regions of the genome).
(v) Looking at only BAC 1-4 & using the same method as before
abde
abd
aefh
acefh
so e falls outside abd block, bd & fh either side of ae
so (bd) a e (fh) is order of markers with bd and fh order unknown
bd
2
1
3
4
a
e
fh
-------------X-------X------------------X-------X--------X-------X-------X--------X---------------X--------X--------X-------------
The left-hand end=points of 1 & 2 are not known relative to each other, as for the right-hand endpoints for 3 & 4.
By similar arguments
g
6
2
1
5
7
8
3
4
d
e
h
------X-------------------X--------------------------X------------------------X----------------X-------------------X----------------X--------X----------------X---------------------X-------------------------X------------------X-----------------X------------------X-------
Relative end-points not clear on left for (2,1,5, 7) and (8,3,4); for right (6,2), (1,5,7,8) & (3,4)
In all cases BACs can be arranged in some sort of correct order & overlaps can certainly be found.
Thus, the important points are (a) that STS mapping is possible with relatively few BACs (must be
sufficient number that they each have at least one overlap at each end) and markers, and (b)
greater marker and BAC density will give better resolution of end-points (& of course, more endpoints).
It is important to note the positive (a) and the limitations (b), and to be very careful with language.
None of the maps with fewer markers or BACs are “incorrect” and when describing them as
uncertain or having less confidence there is some ambiguity. It is best to be explicit about exactly
what information is lost- precision of knowledge about extent of overlaps. It is also important to
3
remember that the objective of mapping is to order the BAC inserts relative to each other, not to
map STS markers.
2. (i) It is obvious that BAC-1 and 3 share almost all fragments, whereas very few fragments
beyond the five used to identify this set of BACs are common to any other pair. It is therefore clear
that BACs 1 and 3 overlap extensively and come from the same genomic region. For the others
you might wonder if they only overlap a little or if they share a set of 6-8 fragments just by coincidence (because size resolution is far from perfect). You could certainly dismiss the idea that all
five BACs overlap by trying to merge a third BAC with any pair of merged BACs (they simply do not
fit). If (other than 1 and 3) they do not all overlap then it is likely that no pair overlaps since they all
have similar numbers of shared fragments. So, the best guess might be that 1 and 3 is the only
overlapping pair. You might be more certain of that conclusion if you did some sophisticated
calculations or if you had direct experience of this kind of experiment. However, that is not critical
here. What is important is to conclude that there is no convincing evidence of BAC1 overlap with
any BAC other than BAC3. Hence, in a real mapping experiment you should not try and merge
BACs unless they have very extensive overlap of fragments.
It is critical for this whole question to appreciate the idea that two fragments of apparently the same
size are not necessarily of exactly the same size and therefore not necessarily from the same
region of the genome. Non-equivalence of that kind will be extremely common because the
number of fragments examined is enormous (maybe about 50 x 200,000) and official resolution
(50bp here) is nowhere near single bp resolution.
Many answers discussed only BAC1 and 3. Discussion of the other BACs is slightly trickier & so
some explicit arguments must be made (beyond the idea that the evidence is not so clear as for
BAC-1 & 3) to arrive at a decision.
(ii) If the library is dense enough you would hope to find many pairs like BAC1 and 3, which overlap
extensively. So, an appropriately conservative strategy would be to take BAC1 and find the BACs
with the closest matches in restriction enzyme fingerprint. BAC-3 might be the closest on one side
and another BAC might be very close to BAC-1 on the other side (displaying one or two unique
fragments not present in BAC-3). You could then (proceeding in each direction outwards) take this
new BAC and, separately BAC-3 and find their closest matches and add them to the growing
merge. In this way you can keep taking small steps outwards without ever having to question
whether overlaps are by co-incidence. For this to be true you really need almost every fragment in
a pair of BACs to overlap in size.
Some answers mentioned a method that is close to this but, not in my opinion, identical- namely a
statistic for the degree of overlap required to call something genuine (depends on genome size and
insert sizes). That was used historically in genome projects. It may allow you to proceed with
slightly smaller libraries than required by my argument above- where very extensive overlap is
required, but not precisely defined.
(iii) If you had too few BACs you would not have pairs that overlap over most of their length and
you would not be able to make the easy decisions (about genuine overlap) described above.
Instead, in most cases, even with the closest pair of BACs (in terms of the number of apparently
common restriction fragments) you would be uncertain of whether the BACs genuinely shared
overlap. Exactly what the cutoff should be to determine if a match should be accepted is not trivial
to determine. It will depend on the accuracy of measuring fragment lengths, the size of the
4
genome and the number of BAC clones (and is determined on those bases for real mapping
projects- termed the Sulston cutoff). From the data here (which are not authentic) we argued in (i)
that the five BACs cannot all overlap, yet roughly 25% of fragments were in common between pairs
of BACs. From that you would say that 25% of similarly sized fragments is not good enough to infer
an overlap. Hence, you would require greater overlap of BACs to be able to call an overlap with this
method. Notice that this is very different to the analogous answer for STS mapping. STS mapping
can work with quite a small library but fingerprinting requires very extensive overlaps, and hence a
very big library. You might think that the requirement for a big library means that unnecessary work
will be done. However, (a) that type of repetitive mapping work is fairly easy to expand, (b) very
large libraries can be made because electroporation of E. coli is so efficient, and (c) the multiple
overlapping BACs can be useful in clearly identifying outliers (artifacts) and assuring you that the
rest of the BACs are fine and properly aligned.
If you wanted to approximate how big the library would be…….
if you allowed only 3-4 restr’n fragments to differ that would mean requiring overlaps for all but
about 20kb or 10% of BAC size. If they were evenly spaced that means you would need about 10fold coverage of the genome. They won’t be so perfectly spaced so perhaps 20-fold would be
about right.
(iv) (a) Fingerprint analysis is usually done with a 6-cutter enzyme such as EcoRI or HindIII. Such
enzymes cut on average once every 46 bp or roughly every 4kb. Of course, you will occasionally
get a 35kb fragment and a 25bp fragment but on average you would expect to see about 50 bands
from each BAC and the majority of bands in the 1-10kb range. You would expect to be able to
position almost all of the restriction sites (small fragments are ignored) on a map of merged BACs
As discussed in Q1 STS markers may well be 20-50kb apart. Hence, you could easily have a
deletion in a BAC between two STS markers, which does not delete either marker and would be
unnoticed. However, any such deletion would either remove one or more restriction fragments or
alter their size in a way that is easily detected. The affected BAC could be merged on the basis of
other fragments.
The potential artifact described is important to detect. If you happened to pick one such BAC clone
as part of a minimal tiling path for sequencing you would end up with the wrong sequence for that
portion of the genome
(b) The restriction map constructed from overlapping BACs ought to be the same as in the genome
from where the BAC fragments were cloned. However, cloning artifacts (changes to the BAC
fragments) and mistakes in aligning them can occur. To check for this you could make probes
from various segments of BACs (even whole BACs if they contain no repetitive DNA sequences)
and use them on genomic Southern blots to see if the genomic restriction fragment sizes match
those in your map.
A key point, missed by most, is that you can actually make such maps for uncloned genomic DNA,
whereas STS maps from uncloned DNA would be hard (perhaps FISH hybridization positions- but
these would have very low resolution).
(c) Once you have a complete genome sequence you can use that information to construct a map
of restriction enzyme sites & compare this with your clone map.
This allows you to compare the two maps very accurately (errors in measuring fragment size and
ignoring small fragments in fingerprinting account for acceptable differences, as would RFLPs).
For STSs the genome sequence could verify the order but that would provide no verification for the
space between STSs (95% of the genome).
5
3. (i) It is very important to generate a diversity of starting-points for sequencing. As an absolute
minimum you would require starting points to be separated by no more than the length of a single
sequence (say 700nt) in order to be sure you get full coverage of any strand. Sau3A sites, which
occasionally will be separated by as much as 2-3kb, would make that impossible. In reality you
require even more extensive staggering of start-points so that you can cover any region of
sequence several times. For BAC alignments the end-points of consecutive BAC inserts can be
staggered by many kb without compromising your ability to detect the overlaps (even by restriction
enzyme fingerprinting) and so the spacing of Sau3A sites is adequate.
(ii) In the first, bulk phase of a shotgun sequencing project you simply take each sequencing read
as a separate piece of information. So long as it represents correct contiguous sequence it does
not matter where the sequence read came from. In a 2x 3kb composite clone a sequencing read
(from each end) would not be long enough to cross the junction between the two inserts so the
entire sequence being read should be correct. Although mate-pair information can be crucial to
finish a sequence that information is never used for the majority of sequenced clones (but indeed a
composite would give mis-leading mate-pair information, perhaps leading you to test an incorrect
scaffolding arrangement & attempting to link sequences that cannot be linked. Note, however, that
you would not actually link the two sequences incorrectly unless you performed primer walking on
the composite clone- which would actually reveal that it is longer than expected).
(iii) As explained above, we only use sequence information. We do not actually order plasmid subclones and we do not need to know the entire sequence of any of those plasmid clones.
(iv)
(a) you should see overlaps among Seq 1-4 and between Seq 5-6 as illustrated below:0
100
300
550
Seq1 ------------------------------------------------------------Seq2
---------------------------------------------------------------------Seq4
----------------------------------------------------------------------------Seq3
-----------------------------------------------------------And
Seq5 --------------------------------------------------------------------Seq6
------------------------------------------------------------------------(b) Once you test an alignment you can decide whether to accept it and how to designate a merge.
Normally, as indicated in the question, the merge will reflect uncertainties by introducing N where
two sequences do not match properly (where N might be a choice of two different nucleotides or a
nucleotide vs a space). If you use an N-ridden merge it will not match well further and your final
product will in any case contain many Ns, the nature of which will eventually have to be resolved.
All this simply re-states the idea in the question that producing too many Ns in a merge is very
undesirable. So the simplest idea is to require that any alignment is perfect or near-perfect over its
entire length. That way “Poor Seq2” would never be included in any merge. Alternatively, you could
require a perfect match over only part of the sequence and you could trim Poor Seq2 so that only
the good portion became incorporated. The first idea of simply ignoring a poor sequence read (or
actively throwing it out once recognized as giving multiple partial alignments) is almost certainly the
6
most effective. You deliberately create a large excess of sequencing reads in these projects and
can afford to reject many sequences if the quality is not high enough.
(c) In a real project you would simply keep looking for merges among many more sequences. If
you have enough sequences merges will become longer and longer without any special steps or
intervention.
(d) Connecting requires obtaining new DNA sequence that overlaps the two contigs. An obvious
strategy is primer walking, where you design a primer based on sequences near the end of each
contig (& pointing into unknown territory). You also need a template. Here, we have a single BAC
which is certain to include the required sequences so that can be used directly. In whole genome
shotgun sequencing it would be necessary to go back to the plasmid (or BAC) clone templates
used to generate sequence reads that are near the ends of the contigs we wish to join. If you
looked for mate-pairs (two sequences in different contigs that were read from the two ends of the
same plasmid) you could get an idea of how much missing sequence there is and be able to
identify a template that could bridge the entire gap.
(e) Both ends of plasmid subclones and BAC clones are sequenced and labeled appropriately.
Hence, you look for sequences near the end of a contig and ask if the mate pair for that sequence
is found on another contig. If it is, then the two contigs represent nearby sequences (separated by
less than the size of the insert in the plasmid in question). You could then use that plasmid as
template for primer walking, eventually generating sequences that spanned the two contigs.
(f) You should see that the new BACs have overlaps with each of the previously assembled contigs
and that they overlap each other in the repeated sequences. Potentially therefore these two new
BACs can complete one contig and if the repeats are tandem (rather than dispersed) that linkage is
reasonable to make. The key point, however, is that we have no idea how many tandem repeats
are really present in the parent (BAC) sequence. If there were additional clones with more repeat
units that could set a minimum. However, we only can establish a maximum (& the exact number)
if we have a sequence that spans the entire set.
(g) If one sequence read proceeds from unique sequence through repeats to unique sequence we
can define the exact correct sequence. This can only happen if the total length of repeats is less
than about 700bp. We might still worry that during generation of that clone there had been some
recombination event that eliminated some repeats compared to the original source DNA (so seeing
the same result in more than one clone would be re-assuring).
7
Download