DNAfragmentAssembly_handouts

advertisement
Name: _____________________
Fragment Assembly Problem:
Sequencing the Human Genome
In this activity we will solve puzzles. The theme of the day will be reconstructing
sequences of letters and numbers that are part of a message, password, or code that has
been chopped into pieces. We’ll call this the Fragment Assembly Problem. This activity
will help improve your problem-solving skills, and you’ll learn about how the code of the
Human Genome was discovered!
Part 1: Song lyrics
The lines of a popular song have been chopped up and put back together in the wrong
order. Can you figure out how to put it back together the right way?
Fragmented message: Reconstructed message:
he world so high, like
ou are.
Twin
a diamond in the sky.
ttle star, how I won
kle, twinkle, li
Up above t
der what y
Q. Was it easy to put it together? Why?
1
Part 2: Long password
You’re working on a class project with four other classmates. You will be using a web
based tool that will allow each of you to access the project, with the right password. You
printed out copies of the password for your four classmates. Unfortunately, your little
brother thought it would be fun to cut up the passwords, leaving you with a dozen or so
little pieces of paper. If that wasn’t bad enough, you can’t find where you originally
wrote the password. You have already put in a full day of work on the project, and it will
be lost if you cannot figure out the password.
Q. Work with a partner to put the fragments back in the correct order. Remember, there
are 4 copies of the same password. What is the password?
Q. Was it more difficult to put these pieces back together than it was for the song? Why?
Q. Briefly describe/discuss your strategy in reconstructing the password. How confident
are you that the sequence you found is the true one? Could you have put the pieces
together in a different way?
Q. What would happen if, instead of 4 copies of the password, you had 20 copies, all cut
into fragments? What if you only had 2 copies of the password, cut into fragments? What
if you had only 1 copy of the password in fragments (you can TRY this one using one
fragmented copy of the sequence)? For each case, would you be able to reassemble the
password? Would it be easier, or more difficult? How sure could you be that you had the
right password?
2
Part 3: Sequencing the Human Genome
Knowledge of our genome can help us to learn about and fight genetic diseases. By
studying the genetic makeup of bacteria and viruses, we can get closer to developing
vaccinations against them. Furthermore, we can learn how we are evolutionarily related
to other organisms by seeing how our DNA compares to theirs. The Human Genome
Project sought to figure out the sequence of bases that makes up the Human Genome. The
project was completed in 2003. You will discover how scientists were able to do it.
Let’s review some things we already know about DNA:
 DNA is composed of sequences of bases (A’s, C’s, G’s, and T’s)
 The bases make up codons (words), each of which codes for a particular amino
acid
 A gene is sequence of codons (amino acids), which contains the instructions for
how proteins should form
 The human genome is the set of all the genes that make up the human species; it’s
composed of billions of bases
The Challenge
Imagine you are a biologist in the 1990’s and you want to determine the genetic code that
makes up the human species, as well as that of other species. Biologists have developed
very clever laboratory methods for sequencing DNA. Unfortunately these methods only
work on short DNA fragments, on the order of hundreds of units. A complete strand of
DNA, however, is made of millions of bases. Suppose you have several copies of a single
piece of DNA from an organism - let’s call it our target strand. How will you figure out
the entire sequence of one million A’s, C’s, G’s, and T’s in these long strands?
Biologists do know how to chop DNA strands up into short fragments, of the size that
their sequencing machines can handle. (Strands which are too short or too long are not
sequenced at all.) Great, you say - let’s just chop our strand up, feed the smaller
pieces into the sequencing machine, and then we will have the whole strand read.
There are some major problems with this. First, when scientists chop up a strand of DNA,
they are not able to control exactly where the cuts are made; instead, the DNA is chopped
into pieces at random locations in the string. Luckily, though, the fragments are usually
small enough to be read by the sequencing machine. The second difficulty is that there is
no way to keep track of how the resulting fragments are ordered in the target strand. So,
you’ll end up with the sequences of hundreds of thousands of DNA fragments, but no
way to piece them together. It’s as if someone chopped up several copies of the great
works of Confucius into tiny snippets, and asked you to recreate the original works. And
you don’t even know how to read ancient Chinese script. Sounds hopeless.
3
Summarize the main science objectives and challenges in your own words:
Let’s represent the problem in a simpler way by setting aside some of the details:
The Fragment Assembly Problem
input (what we have to start with):
 a collection of DNA fragments
output (what we hope to get in the end):
 an assembly of the strands, i.e. a DNA strand (assembly) that includes all input
strands as sub-strands in the correct sequence
Q. If you remove the word ‘DNA’ from the above description of the problem, is it any
different from the password reconstruction problem? Can you apply a similar technique
to that which you used to determine the password? Why is the DNA sequencing problem
so much more challenging? Do you think a computer might be able to help? Why or why
not?
4
Simplify the problem
When faced with a challenging problem, it sometimes helps to think concretely about a
simple example, and explore solutions for this example. Suppose that the target strand is
now much, much shorter, so we can easily write it down - say between 20 and 30 units
long, and your sequencing machine can sequence stands between 6 and 10 units long.
You have at hand two copies of the target strand.
You chop them into fragments and sequence the fragments that are of suitable length.
You get back three fragments, say CAAGACCAA, TTACCGGGCC, and
CAACAAATTAC.
Q. Try to assemble the three fragments. Are there multiple ways to assemble these
fragments? How can you determine which of the possible assemblies is more likely to be
the correct one, and why?
Think about your method:
 Were there any “rules of thumb” you used when assembling the fragments?
 Were there steps or processes that you repeated often (possibly on different
strands)?
 Do you prefer one assembly of the fragments over another? Why? **This is very
important**
5
Q. Assembling DNA fragments - Algorithm
 Write out an algorithm for solving this example of the fragment assembly
problem
 Aim for simplicity
 Your description can be informal, in your own words, but should be clear enough
that someone else in the class could execute it without confusion
Swap with partner and read through their algorithm. Imagine doing exactly what the
algorithm instructs you to do. Is each step clear? Will these directions lead you to
construct the true sequence? Is any step missing or ambiguous?
Q. Describe in words how you would modify your algorithm to handle the “full” DNA
sequencing problem
The Shorter the Assembly, the Better
In the real DNA sequencing problem, many assemblies of a collection of fragments are
possible (remember, the entire sequence is composed of only four letters; there are bound
to be repeated sequences). How can we tell which one is most likely to be the true target
sequence? We can’t be sure, but we might be able to make a good guess.
One criterion to use in selecting the “winner” is the length of the resulting strand: the
shorter, the better. Roughly, the rationale for using this criterion is that, the more overlap
there is between two strands, the more evidence we have that they really are from the
same part of the original DNA strand. After all, it is from the overlap that we are able to
reassemble the fragments.
Q. Explain in your own words why overlap is important, and why the shortest sequence
might be thought to be the “winner.”
6
Computational Thinking approach was key to the sequencing of the human genome
The Human Genome Project began in 1990 with the goal of identifying all the
approximately 20,000-25,000 genes in human DNA and determining the sequences of the
3 billion chemical base pairs that make up human DNA. It was a huge task, but one that
scientists knew was extremely important. Biologists realized at this time that a
computational thinking approach might be useful, so with the help of computer scientists,
the so called “Shotgun Sequencing” algorithm was developed and used to tackle the
sequencing of the human genome. This algorithm was implemented on computers,
allowing for the automation of the DNA sequencing procedure. The project was
completed in 2003.
The “Shotgun Sequencing” algorithm gets its name from the analogy of the rapidlyexpanding, quasi-random firing pattern of a shotgun.
Q. Can you see how the shotgun analogy makes sense, based on what you’ve learned
about DNA sequencing? Explain. Why is it important for the DNA sequencing problem
that the DNA strands are cut up randomly (that is, at random positions in the strand,
which makes the lengths random, too)?
Q. Why do you think computers were so essential to accomplishing the task of
sequencing the human genome?
7
Q. Come up with and explain 1-2 other examples in which computers have played (or
currently play) an essential role in helping scientific progress. If you do not know of one,
try doing some research to learn about one.
8
Download