Name: _____________________ Fragment Assembly Problem: Sequencing the Human Genome In this activity we will solve puzzles. The theme of the day will be reconstructing sequences of letters and numbers that are part of a message, password, or code that has been chopped into pieces. We’ll call this the Fragment Assembly Problem. This activity will help improve your problem-solving skills, and you’ll learn about how the code of the Human Genome was discovered! Part 1: Song lyrics The lines of a popular song have been chopped up and put back together in the wrong order. Can you figure out how to put it back together the right way? Fragmented message: Reconstructed message: he world so high, like ou are. Twin a diamond in the sky. ttle star, how I won kle, twinkle, li Up above t der what y Q. Was it easy to put it together? Why? 1 Part 2: Long password You’re working on a class project with four other classmates. You will be using a web based tool that will allow each of you to access the project, with the right password. You printed out copies of the password for your four classmates. Unfortunately, your little brother thought it would be fun to cut up the passwords, leaving you with a dozen or so little pieces of paper. If that wasn’t bad enough, you can’t find where you originally wrote the password. You have already put in a full day of work on the project, and it will be lost if you cannot figure out the password. Q. Work with a partner to put the fragments back in the correct order. Remember, there are 4 copies of the same password. What is the password? Q. Was it more difficult to put these pieces back together than it was for the song? Why? Q. Briefly describe/discuss your strategy in reconstructing the password. How confident are you that the sequence you found is the true one? Could you have put the pieces together in a different way? Q. What would happen if, instead of 4 copies of the password, you had 20 copies, all cut into fragments? What if you only had 2 copies of the password, cut into fragments? What if you had only 1 copy of the password in fragments (you can TRY this one using one fragmented copy of the sequence)? For each case, would you be able to reassemble the password? Would it be easier, or more difficult? How sure could you be that you had the right password? 2 Part 3: Sequencing the Human Genome Knowledge of our genome can help us to learn about and fight genetic diseases. By studying the genetic makeup of bacteria and viruses, we can get closer to developing vaccinations against them. Furthermore, we can learn how we are evolutionarily related to other organisms by seeing how our DNA compares to theirs. The Human Genome Project sought to figure out the sequence of bases that makes up the Human Genome. The project was completed in 2003. You will discover how scientists were able to do it. Let’s review some things we already know about DNA: DNA is composed of sequences of bases (A’s, C’s, G’s, and T’s) The bases make up codons (words), each of which codes for a particular amino acid A gene is sequence of codons (amino acids), which contains the instructions for how proteins should form The human genome is the set of all the genes that make up the human species; it’s composed of billions of bases The Challenge Imagine you are a biologist in the 1990’s and you want to determine the genetic code that makes up the human species, as well as that of other species. Biologists have developed very clever laboratory methods for sequencing DNA. Unfortunately these methods only work on short DNA fragments, on the order of hundreds of units. A complete strand of DNA, however, is made of millions of bases. Suppose you have several copies of a single piece of DNA from an organism - let’s call it our target strand. How will you figure out the entire sequence of one million A’s, C’s, G’s, and T’s in these long strands? Biologists do know how to chop DNA strands up into short fragments, of the size that their sequencing machines can handle. (Strands which are too short or too long are not sequenced at all.) Great, you say - let’s just chop our strand up, feed the smaller pieces into the sequencing machine, and then we will have the whole strand read. There are some major problems with this. First, when scientists chop up a strand of DNA, they are not able to control exactly where the cuts are made; instead, the DNA is chopped into pieces at random locations in the string. Luckily, though, the fragments are usually small enough to be read by the sequencing machine. The second difficulty is that there is no way to keep track of how the resulting fragments are ordered in the target strand. So, you’ll end up with the sequences of hundreds of thousands of DNA fragments, but no way to piece them together. It’s as if someone chopped up several copies of the great works of Confucius into tiny snippets, and asked you to recreate the original works. And you don’t even know how to read ancient Chinese script. Sounds hopeless. 3 Summarize the main science objectives and challenges in your own words: Let’s represent the problem in a simpler way by setting aside some of the details: The Fragment Assembly Problem input (what we have to start with): a collection of DNA fragments output (what we hope to get in the end): an assembly of the strands, i.e. a DNA strand (assembly) that includes all input strands as sub-strands in the correct sequence Q. If you remove the word ‘DNA’ from the above description of the problem, is it any different from the password reconstruction problem? Can you apply a similar technique to that which you used to determine the password? Why is the DNA sequencing problem so much more challenging? Do you think a computer might be able to help? Why or why not? 4 Simplify the problem When faced with a challenging problem, it sometimes helps to think concretely about a simple example, and explore solutions for this example. Suppose that the target strand is now much, much shorter, so we can easily write it down - say between 20 and 30 units long, and your sequencing machine can sequence stands between 6 and 10 units long. You have at hand two copies of the target strand. You chop them into fragments and sequence the fragments that are of suitable length. You get back three fragments, say CAAGACCAA, TTACCGGGCC, and CAACAAATTAC. Q. Try to assemble the three fragments. Are there multiple ways to assemble these fragments? How can you determine which of the possible assemblies is more likely to be the correct one, and why? Think about your method: Were there any “rules of thumb” you used when assembling the fragments? Were there steps or processes that you repeated often (possibly on different strands)? Do you prefer one assembly of the fragments over another? Why? **This is very important** 5 Q. Assembling DNA fragments - Algorithm Write out an algorithm for solving this example of the fragment assembly problem Aim for simplicity Your description can be informal, in your own words, but should be clear enough that someone else in the class could execute it without confusion Swap with partner and read through their algorithm. Imagine doing exactly what the algorithm instructs you to do. Is each step clear? Will these directions lead you to construct the true sequence? Is any step missing or ambiguous? Q. Describe in words how you would modify your algorithm to handle the “full” DNA sequencing problem The Shorter the Assembly, the Better In the real DNA sequencing problem, many assemblies of a collection of fragments are possible (remember, the entire sequence is composed of only four letters; there are bound to be repeated sequences). How can we tell which one is most likely to be the true target sequence? We can’t be sure, but we might be able to make a good guess. One criterion to use in selecting the “winner” is the length of the resulting strand: the shorter, the better. Roughly, the rationale for using this criterion is that, the more overlap there is between two strands, the more evidence we have that they really are from the same part of the original DNA strand. After all, it is from the overlap that we are able to reassemble the fragments. Q. Explain in your own words why overlap is important, and why the shortest sequence might be thought to be the “winner.” 6 Computational Thinking approach was key to the sequencing of the human genome The Human Genome Project began in 1990 with the goal of identifying all the approximately 20,000-25,000 genes in human DNA and determining the sequences of the 3 billion chemical base pairs that make up human DNA. It was a huge task, but one that scientists knew was extremely important. Biologists realized at this time that a computational thinking approach might be useful, so with the help of computer scientists, the so called “Shotgun Sequencing” algorithm was developed and used to tackle the sequencing of the human genome. This algorithm was implemented on computers, allowing for the automation of the DNA sequencing procedure. The project was completed in 2003. The “Shotgun Sequencing” algorithm gets its name from the analogy of the rapidlyexpanding, quasi-random firing pattern of a shotgun. Q. Can you see how the shotgun analogy makes sense, based on what you’ve learned about DNA sequencing? Explain. Why is it important for the DNA sequencing problem that the DNA strands are cut up randomly (that is, at random positions in the strand, which makes the lengths random, too)? Q. Why do you think computers were so essential to accomplishing the task of sequencing the human genome? 7 Q. Come up with and explain 1-2 other examples in which computers have played (or currently play) an essential role in helping scientific progress. If you do not know of one, try doing some research to learn about one. 8