Nifty Bioinformatics Assignments By Dean Zeller Kent State University Summer, 2007 dzeller@cs.kent.edu http://www.cs.kent.edu/~dzeller Table of Contents Assignment 1 – DNA Modeling ..................................................................1 Assignment 2 – Evolution Models I (Binary Evolution Trees) ...................3 Assignment 3 – Evolution Models II (Incremental K-Leaf Root) ...............6 Assignment 4 – DNA Pattern Matching Statistics .......................................8 Assignment 5 – DNA Visualization ..........................................................13 Assignment 6 – Longest Common Subsequence .......................................17 Introduction The following is a collection of assignments relating to bioinformatics. All assignments have been written in the past two years and were tested in an actual classroom environment. Bioinformatics itself is an involved and complex field, combining concepts of biology, mathematics, and computer science. However, the introductory topics of bioinformatics are not difficult, and need not require a college level mentality. The prerequisite knowledge for these assignments are purposely low, and are intended to “spark” potential interest within students. Copyright notice These assignments and accompanying materials are copyright ©2007 Dean Zeller are all available at the author’s web site. Permission of these assignments is granted for the copy and use for personal or educational purposes. Please contact the author with questions, comments, or feedback. Assignment 1 – DNA Modeling Dean Zeller C&I 47330 Fall, 2006 Due: _________________ 10 points Objective The student will create structurally accurate models of DNA out of pipe cleaners. Materials pipe cleaners, scissors, 3x5 card (optional), hole punch (optional) Level 0 – no experience required Topics This assignment introduces students to concepts of modeling using DNA as the modeled system. It is appropriate for any age over 5, is challenging, and contains of breadth of useful skills in a wide variety of areas. This assignment is suitable for any mathematics, science, or art class. No prior experience or knowledge is required to complete the assignment. Background: Modeling Chemists, mathematicians, and many other scientific professions use models as visual representations of structures. One important aspect of modeling is structural accuracy, i.e. the extent to which it represents key elements of the system it is modeling. In this assignment, the main element to represent about DNA is the bonding of AT and CG nucleotide pairs. As you increase in skills, there are other elements that can be added to increase accuracy. When crafting a model, structural integrity (or stability) is important for the model to keep its shape. It cannot fall apart while being analyzed or modified. There are many materials used in modeling of this nature. For this assignment, the modeling material is pipe cleaners because of their ability to bend and keep their shape. The instability of a pipe cleaner model is in how the pieces are put together. Pieces can be connected through a variety of twisting, wrapping, and coiling. Doing this correctly takes skill and practice. Instructions Follow the steps below to create a single DNA strand of 10 nucleotides. Create additional strands as time permits. See the poster for picture diagrams of the instructions. Step 0: Setup A. Select colors for DNA stalks and nucleotides (A, T, C, G). B. Cut the nucleotide pipe cleaners in fourths. C. Put the nucleotides into AT and CG pairs. Step 1: Create nucleotides Use these steps to create ten (10) nucleotide pairs. A. Select two pipe cleaner pieces of the appropriate colors to form a nucleotide pair (AT or CG). B. Put in a V-shape with a small overlap (1/4”). Choose a long end and the corresponding short end on the same side of the V. C. Coil the long end three or four revolutions around the other short end, covering it completely. There should be some extra left over to attach to a stalk. D. Bend the remaining long and short ends 90° to form a T-shape. E. Repeat step 1C with the remaining long and short ends. Nifty Bioinformatics Assignments – Page 1 Step 2: Connect Nucleotides to Stalks Attach nucleotides to stalks to form a “ladder” A. Attach one side of nucleotide to stalk by wrapping extra length around stalk twice (once on the left, and once on the right). Wrap any remaining length around itself. B. Repeat for other nucleotides, spreading evenly across the stalk. C. Attach opposite stalk in similar way. Leave enough room at ends to connect to other DNA strands. Step 3: …and do the Twist In “ladder” form, the DNA strand may look inconsistent in size. When twisted into a double-helix form, the inconsistencies tend to work themselves out. Further Activities Create a model of an actual DNA sequence. Longer strands: strands of length 100 can be completed in about two hours, once the procedure is learned and practiced. In groups, use an “assembly line” approach to create a very long DNA strand. Class contests: longest DNA strand, strongest DNA strand Grading You will be graded on the following criteria: Accuracy Correct representation of AT/CG pairs in the model Stability Structural integrity of the model Effort Quantity, length, and quality of models created Extra credit will be given for the following: Use a 3x5 card to label the model with the appropriate model nucleotide characters (both sides of the DNA strand). Connect models together to create longer models Nifty Bioinformatics Assignments – Page 2 Assignment 2 – Evolution Models I (Binary Evolution Trees) Dean Zeller CS10051 Spring, 2006 Due: _________________ 10 points Objective The student will use a graphics package to create diagrams of binary evolution trees. Level 1 – use of a graphics package and/or word processor required Readings Read, R.C., and R.J. Wilson, An Atlas of Graphs, Oxford Science Publications, 1999. Background: Evolution Trees This assignment deals with the cutting-edge topic of bioinformatics. It is a complex field of graph theory with applications to mathematics, computer science, and genetics. An evolutionary tree (or phylogeny) is a tree-structure that demonstrates evolution of species over time. A tree consists of vertices (nodes) connected by edges (links). In evolution, a node represents a point in which a species population “splits” into two genetically different species. Once a species splits, the two species created are genetically unable to produce offspring. The leaves of the tree represent the extant (non-extinct) species. Background: Isomorphism Two trees are isomorphic if they contain the same structure and are different only through symmetry of a node. In order to simplify the problem, isomorphic trees are considered the same for purposes of this assignment. y y y y A B C D x z x z x z x z (1242) F E (1244a) (1242) (1242) (1244b) (1242) Figure 1 – Isomorphic trees A and B are isomorphic at node x. B and C are isomorphic at node y. C and D are isomorphic at node z. A and D are isomorphic at node y. As such, A, B, C, and D are all isomorphic, and thus considered the same for purposes of this assignment. E and F are not isomorphic because they are structurally different. Background: Classification labels In order to easily name trees, they are given a text description called a classification label. This allows a text label instead of a visual picture to represent a tree structure. A good classification system has a unique name for each tree. For this assignment, the classification system is simply a listing of the number of nodes at each level. Trees A, B, C, and D above have the label (1242), indicating the first level has one node, the second has two nodes, the third has four nodes, and the fourth has two. Since the four trees are isomorphic, the same label represents all four trees. Ultimately, this classification system is incomplete. While simple to understand, it does not provide unique names for each tree at the higher levels. Trees E and F are not isomorphic, and thus are given separate labels (1244a and 1244b). This system will suffice for now, but can get confusing as the size of the trees increase. Background: Assumptions Introductory study of phylogenies makes another assumption that greatly simplifies the problem. For purposes of this assignment, the only non-leaf structure possible are nodes with exactly two offspring, Nifty Bioinformatics Assignments – Page 3 representing a point in time in which a population “splits” into two populations. This is called a binary evolution tree. A node with a single offspring is called a redundant node and does not significantly add to the tree structure. Nodes with three or more children can be isomorphically approximated, and thus can be ignored at this stage with only a minimal loss of information. While all nodes in evolution trees are unique species, at this point only the leaf nodes need to be considered. Task 1 – Draw Given Trees (5 points) Given below are thirteen evolution trees. This represents the complete set of all non-isomorphic evolution trees of up to six leaves (11 nodes). Use a graphics package to recreate these trees. Give the classification for the tree and label its leaves with successive letters (a, b, c, etc…) Your design style may differ from the trees below, but the structure must be correct. a (1222) (122) (12) b) Figure 2 – Tree structure replacements a) Redundant node removed b) Isomorphic approximation (12222) (124) c b a) e d c a b d a b c d c a b a b (1224) (1242) (122222) f e e c a b c d d d e c a b a b (12224) (12242) (12422) f f e d e c a b c d d e c a b a b (1244a) (1244b) e a b c d f f c a b d e f Nifty Bioinformatics Assignments – Page 4 Task 2 – Generate Trees (5 points) Use a graphics package to create the eleven isomorphically unique evolutionary trees of 7 leaves (13 nodes). The following labels are the classifications for the unique trees: 1222222, 122224, 122242, 122422, 12244a, 12244b, 124222, 12424, 12442a, 12442b, and 1246. Grading: You will be graded on the following criteria: Accuracy Correctly drawing the evolutionary trees, attention to detail. Creativity Style, appearance, and consistency of the trees Extra credit will be given for including the following: Create the sixteen 8-leaf trees: 12222222, 1222224, 1222242, 1222422, 1224222, 1242222, 124224, 124242, 124422a, 124422b, 12444aa, 12444ab, 12444ba, 12444bb, 12462a, 12462b, 1248. Create the 9 leaf (17 node) non-isomorphic trees and classifications. Create the trees without the redundant node and isomorphic approximation assumptions, allowing for any number of offspring from a node. This will exponentially increase the number of possible trees. Graphs Diagram Drawing Guidelines Creating well-drawn graphs is an art form in itself. A diagram is a visual representation, and care should be taken to make sure diagrams are organized, readable, and easily understood. When using a graphics package to draw graphs, it is imperative that students use a consistent and visually pleasing style. Sloppy graph diagrams stick out like a sore thumb and can completely ruin a professional presentation. Follow the below guidelines for drawing your graphs. See the figures for examples of following and violating these guidelines. Vertices Labels Edges Graph All vertex markers should be the same size and shape. If necessary, different markers can represent types of vertices. It should be clear which vertex the label references. Use a reasonable location scheme for the labels. Vertex labels should not cover any vertex, edge, or other label. A logical labeling sequence may help in understanding the graph structure. All edges should be straight lines of the same thickness and connect to the center of the vertex markers. Edge highlighting can be done with thicker or grayed lines. Planar graphs should be drawn without any crossing edges. Curved edges are generally avoided, but may be used for artistic effect. The graph shape should be pleasing to the eye. Use a regular polygon or other meaningful shape whenever possible. Physical distances between vertices should be relatively consistent. Use the alignment and distribution tool within the graphics package. h b h b g c d c a b e a c e a e d d b f g e c d f a Nifty Bioinformatics Assignments – Page 5 Assignment 3 – Evolution Models II (Incremental k-Leaf Root) Dean Zeller CS10051 Spring, 2006 Due: Wednesday, March 22nd by 9:00 pm 10 points Objective The student will build on assignment 2 to create diagrams of the incremental k-leaf root evolution model. Level 1 – use of a graphics package and/or word processor required Readings Read, R.C., and R.J. Wilson, An Atlas of Graphs, Oxford Science Publications, 1999. Background: k-leaf root, cliques As an alternate to phylogenies, evolution can be modeled by a series of incremental graphs called k-leaf roots that contain visual information indicating the distances between species. The tree leaves serve as the graph vertices. Edges within the graph indicate the distance between the two species is no more than a specified k-value in a corresponding tree. A clique is defined as a subset of vertices such that all members are connected to all other members within the clique. An incremental k-leaf root model is essentially a series of (possibly overlapping) cliques. In the diagrams below, the corresponding cliques are labeled below the graphs. Instructions 1. Use a word processor to create a table similar to the example below. a. Set your document Page Setup to Landscape to give more space widthwise. b. The first cell in each row should contain the phylogenies from assignment 2. c. The column header should contain increasing k-values from two (2) up to the number of leaves in the tree. 2. Within each row… a. Draw the k-leaf root for increasing k-values. Start at k=2 and end when the graph is fully connected. Draw a left arrow () for graphs that show no change from the previous kvalue. b. Indicate the cliques created within each distance graph. 3. Create a phylogeny with at least 10 leaves and the corresponding incremental k-leaf root evolution model. 4. Review the graphs drawn so far. Indicate which distance graphs you think contain the most useful information for geneticists. Grading You will be graded on the following criteria: Accuracy Correctly drawing the evolutionary trees. Creativity Style, appearance, and consistency of the graphs Extra credit will be given for any of the following: Create more complex phylogenies for the report. Nifty Bioinformatics Assignments – Page 6 phylogeny (12) 2 leaves (3 vertices) k=3 k=2 a a k=4 k=5 k=4 k=5 b ab b phylogeny 3 leaves (5 vertices) k=3 a k=2 a (122) c c a b phylogeny (1222) b ab c abc b 4 leaves (7 vertices) k=3 k=2 a k=4 b a b a c d abc, cd c d k=5 b d c d a b (124) ab b a d ab, cd c d a abcd c b a b c d abcd 5 leaves (9 vertices) k=2 k=3 a a phylogeny (12222) e e b e c k=4 a b k=5 a e b e b d c d a b c d abc, cd, ef c ab c d abcd, cde c abcde a a (1224) d e b e b e c d a b c d ab, cd d c abcde Nifty Bioinformatics Assignments – Page 7 Assignment 4 – DNA Pattern Matching Statistics Dean Zeller CS10061 Spring, 2007 Due: _________________ 10 points Objective The student will create a Python program to calculate pattern matching statistics on DNA sequences. Level 2 – understanding and use of the Python programming language Background: Bioinformatics DNA can be represented digitally by long strings of A, C, G, and T. In this assignment, you will develop a program to perform statistics on a given DNA sequence. Definitions Sequence DNA Sequence Pattern A long string of characters. A long string of A’s, C’s, G’s, and T’s representing a DNA strand. A short string of characters to be searched within a sequence Concepts Introduced Strings as character arrays # can access each character in a string individually name = "Dean Zeller" print name[0], name[1], name[2], name[3] String length # len(string) returns the length of a string in characters print len(name) for i in range(len(name)): print name[i], File input # input from file instead of from the user inputfile = open("input.txt", "r") contents = inputfile.read() # contents contains entire file inputfile.close() File output # output to file instead of to the IDLE window outputfile = open("output.txt", "w") outputfile.write("This is my output. 4-3-2-1-Wheee! ") … # when finished writing, close the file outputfile.close() While loops # similar to if, but continues looping until condition is false response = raw_input("Would you like to do something? ") while response.upper()=="Y": print "Okay, do something! " response = raw_input("Would you like to do it again? ") Current Programs In DNAstats.py, user input consists of a filename containing a sequence of DNA. The output is a report of statistics for short patterns (i.e. number of A’s, number of C’s, etc.). It then allows the user to search for patterns of any length within the given sequence. RandomDNA.py is a program to generate synthetic DNA randomly for testing purposes. Read, understand, and test these programs before starting on the tasks. Nifty Bioinformatics Assignments – Page 8 Tasks DNAstats.py contains the framework for a program to statistically analyze DNA sequences. Implement at least three of the tasks described below. More can be completed for extra credit. 1. 2. 3. 4. 5. Implement all two-character patterns (16) in the statistics report. Implement all three-character patterns (64) in the statistics report. Put the entire main program into a while loop to allow the user to run statistics on multiple files. Write the statistics report and patterns searched to an output file. Prompt the user for the name of the output file. Use the program to find long patterns that occur more than once, short patterns that do not occur, and relationships between the patterns. 6. Find actual DNA sequences on the web and run them through the program. Research some meaningful patterns and look for them in some DNA sequences. 7. Use the program to create the data for a table similar to table 1. You may use Microsoft Excel to enter the values and calculate the percentages. 8. Repeat all analyses on the reverse of the string. To do this, create a new string that is the characters of the old string in reverse order and send through the analysis methods. 9. Create a loop to cycle through all possible patterns. This is not an easy task, but if someone is up to a challenge, see me for help. 10. Modify the randomDNA.py program to create random or weighted sequences. Compare your analysis with actual DNA files. 11. Own idea: write up a task idea of your own and implement within the program. Get your idea approved first by your instructor. Documentation Your instructor originally wrote this code, as noted in the documentation. Create documentation blocks for any new functions you create, and list yourself as the author. Turning in your assignment 1. Print a copy of your DNAstats program. If you made significant modifications to the randomDNA program, print that as well. 2. Print a test your program with the given input files and others that you create. 3. Write a report briefly describing the tasks completed. 4. Informally demonstrate the tasks completed to the class Grading You will be graded on the following criteria: Quantity Variety of tasks correctly implemented Readability Documentation indicating the lines of code created and modified Testing Testing your program on all input files Creativity Use new methods to solve the problem Extra Credit Extra points will be given for including the following features: Extra quantity Implement more than three tasks Report/Analysis Include a formal report and spreadsheet of your analysis Table 1: DNA Statistics A A C G T total avg # avg % C Pat # % Pat # AA 21756 35% AC 10882 CA 13824 22% CC 9404 GA 9802 16% GC 9117 TA 17170 27% TC 13128 62552 42531 15638 10633 25% G % 26% 22% 21% 31% 25% Pat # AG 9815 CG 5999 GG 6102 TG 12528 34444 8611 T % 28% 17% 18% 36% Pat # % total avg # avg % AT 20100 31% 62553 15638 30% CT 13304 20% 42531 10633 20% GT 9423 14% 34444 8611 17% TT 22839 35% 65665 16416 32% 65666 16417 25% 25% Nifty Bioinformatics Assignments – Page 9 DNAstats.py ######################################################################## # # # DNA Statistics 1.00 # # Written by: Dean Zeller # # # # This program was written for demonstration purposes for CS10061 # # (Introduction to Programming) at Kent State University. # # (C) 2007 Dean Zeller # # # # This program performs pattern matching statistics on DNA # # sequences. # # # ######################################################################## ######################################################################## # # # dnaTools # # This object contains various methods to perform pattern-mathing # # statistics on a given DNA sequence. # # # ######################################################################## class dnaTools(object): #################################################################### # printTitle # # Print the title centered in a space width characters wide # #################################################################### def printTitle(self, title, width): print "+" + "-"*(width-2) + "+" print "|" + title.center(width-2) + "|" print "+" + "-"*(width-2) + "+" #################################################################### # removeErrors # # Return DNA sequence with non-ACTG characters removed. # #################################################################### def removeErrors(self,seq): newSeq = "" for i in range(len(seq)): if seq[i]=="A" or seq[i]=="C" or seq[i]=="G" or seq[i]=="T": newSeq = newSeq + seq[i] return newSeq #################################################################### # stats # # Perform basic statistics on a given DNA sequence # #################################################################### def stats(self, seq): numA=0 numC=0 numG=0 numT=0 numX=0 # number of errors numAT=0 numCG=0 numCAT=0 numACT=0 for i in range(0,len(seq)): if seq[i]=='A': numA = numA+1 elif seq[i]=='C': numC = numC+1 elif seq[i]=='G': numG = numG+1 elif seq[i]=='T': numT = numT+1 else: numX = numX+1 # 1-character patterns Nifty Bioinformatics Assignments – Page 10 for i in range(1,len(seq)): # 2-character patterns if seq[i-1]=='A' and seq[i]=='T': numAT = numAT+1 elif seq[i-1]=='C' and seq[i]=='G': numCG = numCG+1 for i in range(2,len(seq)): # 3-character patterns if seq[i-2]=='C' and seq[i-1]=='A' and seq[i]=='T': numCAT = numCAT+1 elif seq[i-2]=='A' and seq[i-1]=='C' and seq[i]=='T': numACT = numACT+1 w=5 self.printTitle("Statistics Report",80) if numX > 0: print "WARNING -- There were",numX,"errors print "A:".rjust(w), str(numA).rjust(w), ' print "C:".rjust(w), str(numC).rjust(w), ' print "G:".rjust(w), str(numG).rjust(w), ' print "T:".rjust(w), str(numT).rjust(w) print "AT:".rjust(w), str(numAT).rjust(w), ' print "CG:".rjust(w), str(numCG).rjust(w), ' print "CAT:".rjust(w), str(numCAT).rjust(w), ' print "ACT:".rjust(w), str(numACT).rjust(w), ' print in the sequence." '.rjust(w), '.rjust(w), '.rjust(w), '.rjust(w), '.rjust(w) '.rjust(w), '.rjust(w) #################################################################### # matchSimplePattern # # Count the number of occurrances of pat (pattern) within # # seq (sequence) using a simple string-matching algorithm. # #################################################################### def matchSimplePattern(self, pat, seq): numPat=0 for i in range(len(seq)): if seq[i:i+len(pat)]==pat: numPat += 1 print numPat ######################################################################## # main program # # Demonstrate the methods defined in the dnaTools object. # ######################################################################## # get input file inputFileName = raw_input("Enter DNA Sequence filename: ") dnaFile = open(inputFileName,"r") dnaSequence = dnaFile.read() dnaFile.close() # set up tools D = dnaTools() # run stats on original sequence D.stats(dnaSequence) # run stats on fixed sequence dnaFixed = D.removeErrors(dnaSequence) D.stats(dnaFixed) # allow user to search for patterns while True: pattern = raw_input("Enter search pattern (blank to exit): ") if pattern=="": break D.matchSimplePattern(pattern.upper(),dnaFixed) print "Thanks, and have a nice day." Nifty Bioinformatics Assignments – Page 11 randomDNA.py import random size = 10000 # Equal probability sequence = "" for i in range(size): r = random.randrange(1,5) if r==1: sequence += "A" elif r==2: sequence += "C" elif r==3: sequence += "G" else: sequence += "T" print print size,"character random DNA sequence:" print sequence outputfile = open("random.txt","w") outputfile.write(sequence) outputfile.close() # Weighted probability sequence="" weightA = 10 weightC = 15 weightG = 5 weightT = 13 total = weightA + weightC + weightG + weightT for i in range(10000): r = random.randrange(0,total) if r < weightA: sequence += "A" elif r < weightA + weightC: sequence += "C" elif r < weightA + weightC + weightG: sequence += "G" else: sequence += "T" print print size,"character weighted DNA sequence:" print "weightA =",weightA,weightA*1.0/total print "weightC =",weightC,weightC*1.0/total print "weightG =",weightG,weightG*1.0/total print "weightT =",weightT,weightT*1.0/total print sequence outputfile = open("weighted.txt","w") outputfile.write(sequence) outputfile.close() Nifty Bioinformatics Assignments – Page 12 Assignment 5 – DNA Visualization Dean Zeller CS10061 Spring, 2007 Due: _________________ 10 points Objective The student will write a Python program to implement the chaos-game representation of DNA. Level 2 – understanding and use of Python with the Tkinter graphics library Readings H. Joel Jeffrey (1990). “Chaos game representation of gene structure.” Nucleic Acids Research, Vol 18, No. 8, pp 2163-2170. Background: Chaos Game Representation (CGR) It is difficult for humans to look at a bunch of numbers and find meaningful patterns. It is far more effective to provide a visual represention the statistics. Bar charts, pie charts, and line graphs are used to represent data in a visual manner. The chaos game representation is just one of many DNA visualization algorithms. Analyzing the data in this format can show some characteristics about the DNA structure. DNA sequences can be found at the National Center for Biotechnology Information website at http://www.ncbi.nlm.nih.gov. Interesting fractal patterns can be created from contrived examples of DNA. For example, a random sequence will fill the square uniformly, but a random sequence without any A’s will create the famous Sierpinski triangle fractal. Input The user must provide parameters for the drawing of the visualization. Collect these values from the user before execution. You may create more parameters as needed. Filename: the file containing the DNA sequence Interval size: how often to pause execution Dot size: the radius of the dot created at each point Output Follow the CGR algorithm to draw the appropriate dots for a given DNA sequence. Use the Interval Size variable to pause the drawing process for the user. Testing Test your program on a wide range of actual DNA sequences, random sequences, and repeated patterns. Compare the visualization to the report generated, and make note of any patterns found. Tasks Implement at least three of the following tasks using your chaos automata visualization: 1. 2. 3. 4. 5. 6. 7. Implement the chaos-automata algorithm using Python. Create visualizations for actual DNA sequences for organisms. Use the randomDNA.py program to generate DNA files with specific characteristics to create different artistic patterns. Create guidelines similar to the diagrams above to indicate the different quadrants. Allow the user to specify the characters represented for each corner. Change the color of the dot every so often. Or allow the user to change the color at every interval. Own idea: write up a task idea of your own and implement within the program. Get your idea approved first by your instructor. Documentation This assignment builds on the previous assignment. Create documentation blocks for any new functions you create, and list yourself as the author. Nifty Bioinformatics Assignments – Page 13 Turning in your assignment 1. Print a copy of your CGR program. If you made significant modifications to the randomDNA program, print that as well. 2. Print at least three interesting patterns generated by your program. 3. Write a report briefly describing the tasks completed. Grading You will be graded on the following criteria: Quantity Variety of tasks correctly implemented Readability Documentation indicating the lines of code created and modified Testing Testing your program on a variety of input files Creativity Use new methods to solve the problem Extra credit will be given for including the following features: Extra quantity Implement more than three tasks Report/Analysis Include a formal report and spreadsheet of your analysis CGR algorithm This assignment combines concepts from the previous assignment on DNA sequencing and material from the graphics assignments. This assignment will implement the CGR algorithm to visualize the input DNA. The algorithm is actually quite simple, once the graphics procedures are understood. Step 0: Create the DNA Square. Based on parameters from the user, create a square within the canvas where all points will be drawn. You will need to know the left, top, bottom, and right positions within the square. Step 1: Initial Point Create two variables, x and y, denoting the point to draw. The initial point should be the center of the square. Note: you do not draw the initial point; it just serves as a starting point. x = (left + right)/2 y = (left + right)/2 Step 2: Draw dots For each character in the DNA sequence a) Calculate the next point to draw, which is halfway between the current point and the appropriate corner for the letter. (A: top-left, C: top-right, G: bottom-right, T: bottom-left) b) Draw the current point Pseudocode The following is pseudocode for the CGR algorithm. Your job is to implement the algorithm using Python. x = (left + right)/2 y = (top + bottom)/2 for i each character in seq if seq[i]==’A’ x = (x+left)/2 y = (y+top)/2 else if seq[i]==’C’ x = (x+right)/2 y = (y+top)/2 else if seq[i]==’G’ x = (x+right)/2 y = (y+bottom)/2 else if seq[i]==’T’ x = (x+left)/2 y = (y+bottom)/2 drawDot(x,y) Interrupting Execution The following Python code will interrupt a loop every interval times. interval = 10 for i in range(1000): print i, if i%interval == 0: raw_input(“press return to continue”) Nifty Bioinformatics Assignments – Page 14 Example: Sequence: “ATAGCCTGTGA” A Initial Setup T A AT + A T A ATAGC + C T A T ATAGCCTG + T C A G T C A G T C A G T C A G T +A ATA + G ATAGCC + T ATAGCCTGT + G C A G T C A G T C A G T C A G T A+T C G ATAG + C C G ATAGCCT + G C G ATAGCCTGTG + A C G Nifty Bioinformatics Assignments – Page 15 Analysis Consider the diagrams below. The first contains additional guidelines for different size patterns. The final diagram is the chaos-automata result for the sequence ATAGCCTGTGA. Note the correspondence between the final pattern. A C A C A C AA CA AC A C AAA CAA ACA CCA AAC CAC ACC CCC CC TAA GAA TCA GCA TAC GAC TCC GCC TA GA TC ATA CTA AGA CGA ATC CTC AGC CGC GC TTA GTA TGA GGA TTC GTC TGC GGC T G AT CT AG AAT CAT ACT CCT AAG CAG ACG CCG CG TAT GAT TCT GCT TAG GAG TCG GCG TT GT TG ATT CTT AGT CGT ATG CTG AGG CGG GG TTT T A A ATAGCCTGTGA C G T C A AA ATAGCCTGTGA CA CC AC G T C A GTT TGT GGT TTG GTG TGG GGG G ATAGCCTGTGA C AAA CAA ACA CCA AAC CAC ACC CCC TAA GAA TCA GCA TAC GAC TCC GCC TA GA TC ATA CTA AGA CGA ATC CTC AGC CGC GC TTA GTA TGA GGA TTC GTC TGC GGC T AT G CT AG AAT CAT ACT CCT AAG CAG ACG CCG CG TAT GAT TCT GCT TAG GAG TCG GCG TT GT TG ATT CTT AGT CGT ATG CTG AGG CGG GG TTT T G A: C: T: G: 3 2 3 3 T G CC: TA: GA: GC: AT: CT: AG: GT: TG: 1 1 1 1 1 1 1 1 2 GTT TGT GGT TTG GTG TGG GGG T G GCC: ATA: AGC: TGA: CCT: TAG: CTG: TGT: GTG: 1 1 1 1 1 1 1 1 1 Nifty Bioinformatics Assignments – Page 16 Assignment 6 – Longest Common Subsequence Dean Zeller CS33001 Fall, 2006 Due: _________________ 10 points Objective The student will write a C++ program to implement and test the dynamic programming approach to the longest common subsequence problem. Level 2 – understanding and use of the C++ programming language Topics covered This assignment is a review of your skills as a programmer using the string and two-dimensional array data structures. Background: Bioinformatics DNA can be represented digitally by long strings of A, C, G, and T. One of the most simplistic tests to determine genetic similarity is the longest subsequence of nucleotides common to both species. Determining this similarity value is of great use to bioinformatic scientists to determine position on an evolutionary tree. It is a common problem in computer science to determine the longest common substring of two strings, particularly useful in web search engines. In bioinformatics, the largest subsequence of two strings is far more useful (and difficult) to determine. It is simple to write a brute-force program to solve the problem. However, DNA strings can be millions of characters long, and thus a brute-force method could take hours (or even years) to give results for real-world data. This assignment implements the dynamic programming method of efficiently solving the longest common subsequence problem in O(n2) time. Program Requirements Input Prompt the user to enter two strings, representing the DNA sequence of two species. Alternately, the strings can be read in from a file. Error checking Before executing the algorithm, check each character in both strings to ensure only acceptable DNA letters are entered. Do not run the algorithm on input with erroneous data. In the event of an error, indicate how many characters are illegal, and recreate the string with the illegal letters highlighted (example: AGCTSACT agctSact). Output The program should output the dynamic subsequence table and the length of the longest subsequence. The actual subsequence string can be printed for extra credit. Program Architecture Functions Use functions to modularize your code. The following is a suggestion on how to organize your functions. functions.h header file for functions functions.cpp implementation of the functions main.cpp main program implementation that calls the functions Documentation Each function must include a documentation block describing the input, output, and method of solving the problem. Create documentation for the main program describing the program overall. Use proper indentation and self-descriptive variable names. Nifty Bioinformatics Assignments – Page 17 Grading You will be graded on the following criteria: Accuracy Correctly implementing and using the LCS algorithm Readability Documentation, descriptive variable names, and indenting Testing Ensuring the code works for a variety of inputs Creativity Using new methods to solve the problem Extra credit will be given for including the following features: File input Allow the user to enter a filename with the two input strings Subsequence Print the longest common subsequence in addition to the length Brute-force In addition to the dynamic programming method described in this assignment, also implement a brute-force solution to the problem. Create a short report that describes the program duration for a variety of input lengths. Turning in your assignment 1. Print all C++ and header files. 2. Print a test of your program with input sets given below. Algorithm Pseudocode Given below is the pseudocode to solve the problem. You are to implement the pseudocode in objectoriented C++ functions. Step 1: initialize variables string s = ‘ ‘ + string1 string t = ‘ ‘ + string2 Twidth = s.length Theight = t.length int T[Twidth][Theight] for (i=0; i<Twidth; i++) T[i,0] = 0 end-for for (j=0; j<Theight; j++) T[0,j] = 0 end-for Step 2: Run LCS algorithm for (j=1; j<Theight; j++) for (i=1; i<Twidth; i++) if (s[i] == t[j]) // match in subsequence T[i,j] = T[i-1,j-1] + 1 else // no match, use previous value T[i,j] = max (T[i-1,j], T[i,j-1]) end-if end-for end-for subsequenceLength = T[Twidth][Theight] Nifty Bioinformatics Assignments – Page 18 Test Input Sets Test your program on the following sets of input data. Input Set #1 String1: ATCCGACAAC String2: ATCGCATCTT Output: 7 (ATCGCAC) Input Set #2 String1: AACGTTCOGMA String2: GGATACCASAT Output: errors in String1 (aacgttcOgMa) error in String 2 (ggataccaSat) Input Set #3 String1: AAAATTTT String2: TAAATG Output: 4 (AAAT) Input Set #4 String1: TAGTAGTAGTAGTAGTAG String2: CATCATCATCATCA Output: 8 (ATATATAT) or 8 (CACACACA) Example of Method Rule for T[i,j]: if (string1[i] == string2[j]) T[i,j] = T[i-1,j-1] + 1 otherwise T[i,j] = maximum (T[i-1,j], T[i,j-1]) string1 = “AAAATTTT” string2 = “TAAATG” string1 = “ATCCGACAAC” string2 = “ATCGCATCTT” 0 A A A A 0 0 0 0 G A C A A C 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 T 0 2 2 2 2 2 2 2 A 0 3 3 3 3 3 3 A 0 4 4 4 4 4 4 A 0 1 2 3 3 5 T 0 1 2 3 3 4 4 4 4 6 G 0 1 2 3 3 4 4 4 4 1 T 0 1 2 C 0 1 2 G 0 1 2 +1 2 +1 3 3 +1 1 2 3 +1 3 3 +1 +1 4 1 +1 +1 4 +1 0 T C 0 A T C A 0 T T 0 +1 C T A 4 5 +1 2 3 4 4 5 4 4 5 +1 0 +1 1 +1 1 +1 0 +1 1 +1 2 +1 0 +1 1 +1 2 +1 5 5 0 +1 0 +1 0 +1 0 1 1 1 1 1 1 1 1 1 2 2 2 2 2 +1 +1 +1 +1 +1 0 +1 3 +1 3 +1 3 +1 3 +1 +1 5 6 6 5 6 6 +1 T 0 1 C 0 1 T 0 1 T 0 1 2 3 +1 +1 +1 6 +1 2 3 4 4 5 6 6 6 7 2 3 4 5 6 6 6 6 7 2 3 4 5 6 6 6 6 7 +1 +1 Nifty Bioinformatics Assignments – Page 19