Assignment 3 – Longest Common Subsequence Dean Zeller1 CS33001 Fall, 2006 Objective: Due: Thursday, September 21st by 2:00 PM 10 points The student will write a C++ program to implement a dynamic programming approach to the longest common subsequence problem. Topics covered This assignment is a review of your skills as a programmer using the string and two-dimensional array data structures. Background: Bioinformatics From Wikipedia: Deoxyribonucleic acid (DNA) is a nucleic acid — usually in the form of a double helix — that contains the genetic instructions (or genocode) monitoring the biological development of all cellular forms of life. DNA is a long polymer of nucleotides and encodes the sequence of the amino acid residues in proteins using the genetic code of nucleotides. DNA is thought to date back to between approximately 3.5 to 4.6 billion years ago. Computers execute machine code, a series of 0’s and 1’s. The machine code for living organisms is DNA, a sequence of four nucleotides: adenine, cytosine, guanine, and thymine. Machine code and DNA are very similar in theoretical structure. Thus, a technique that is useful in computer science can also be useful in genetics. DNA can be represented digitally by long strings of A, C, G, and T. One of the most simplistic tests to determine genetic similarity is the longest subsequence of nucleotides common to both species. Determining this similarity value is of great use to bioinformatic scientists to determine position on an evolutionary tree. (Evolution trees will be discussed later in the semester.) It is a common problem in computer science to determine the longest common substring of two strings, particularly useful in web search engines. In bioinformatics, the largest subsequence of two strings is far more useful (and difficult) to determine. It is simple to write a brute-force program to solve the problem. DNA strings can be millions of characters long, and thus a brute-force method could take years (or even centuries) to give results for real-world data. This assignment implements a dynamic programming method of efficiently solving the longest common subsequence problem. Program Requirements Input Prompt the user to enter two strings, representing the DNA sequence of two species. Alternately, the strings can be read in from a file. Error checking Before executing the algorithm, check each character in both strings to ensure only acceptable DNA letters are entered. Do not run the algorithm on input with erroneous data. In the event of an error, indicate how many characters are illegal, and recreate the string with the illegal letters highlighted (example: AGCTSACT agctSact). Output The program should output the dynamic subsequence table and the length of the longest subsequence. The actual subsequence string can be printed for extra credit. Program Architecture Functions Documentation 1 Use functions to modularize your code. The following is a suggestion on how to organize your functions. functions.h header file for functions functions.cpp implementation of the functions main.cpp main program implementation that calls the functions Each function must include a documentation block describing the input, output, and method of solving the problem. Create documentation for the main program describing the program overall. Use proper indentation and self-descriptive variable names. Written by Dean Zeller. Edited by Jasmine Boscom, Daryl Popig, and John Withers. Grading You will be graded on the following criteria: Accuracy Correctly implementing and using the LCS algorithm Readability Documentation, descriptive variable names, and indenting Testing Ensuring the code works for a variety of inputs Creativity Using new methods to solve the problem Extra Credit Extra points will be given for including the following features: File input Allow the user to enter a filename with the two input strings Subsequence Print the longest common subsequence in addition to the length Brute-force In addition to the dynamic programming method described in this assignment, also implement a brute-force solution to the problem. Create a short report that describes the program duration for a variety of input lengths. Experiment Reconstruct the evolution tree on the species described in Input Set #5. Run your program on each pair of species in Input Set #5 below. Use the similarity values to reconstruct a possible evolution tree, in which species that are more similar are closer on the tree. Turning in your assignment 1. Print all C++ and header files. 2. Print a test of your program with input sets given below. Algorithm Pseudocode Given below is the pseudocode to solve the problem. You are to implement the pseudocode in object-oriented C++ functions. Note that this pseudocode has not yet been tested. Step 1: initialize variables string s = string1 string t = string2 Twidth = s.length + 1 Theight = t.length + 1 int T[Twidth][Theight] for (i=0; i<Twidth; i++) T[i,0] = 0 end-for for (j=0; j<Theight; j++) T[0,j] = 0 end-for Step 2: Run LCS algorithm for (j=1; j<Theight; j++) for (i=1; i<Twidth; i++) if (s[i] == t[j]) // match in subsequence T[i,j] = T[i-1,j-1] + 1 else // no match, use previous value T[i,j] = max (T[i-1,j], T[i,j-1]) end-if end-for end-for subsequenceLength = T[Twidth][Theight] Step 3: Determine subsequence string (extra credit) To recreate the string, you must keep track of the times the algorithm found a matching letter in the sequence. The exact method to keep track of the sequence matches is up to you. Test Input Sets Test your program on the following sets of input data. Input Set #1 String1: ATCCGACAAC String2: ATCGCATCTT Output: 7 (ATCGCAC) Input Set #5 (extra credit) Use your program to determine which of the following DNA sequences are most similar. This input set is to simulate a real genetic experiment. In a real experiment there could be many species, each DNA sequence thousands of characters long. String1: GTCACTTCACGGGTACAGACTTAAACG String2: ACCATTACGGCGATACCAGGATAC String3: TTTTATTTAGGACAGACTAGACCAGGT String4: CCGTAGATCGATACGATACCCACCTCAGG String5: CCGTAGATCGATACGATACCCACCTCAGG Input Set #2 String1: AACGTTCOGMA String2: GGATACCASAT Output: errors in String1 (aacgttcOgMa) error in String 2 (ggataccaSat) Input Set #3 String1: AAAATTTT String2: TAAATG Output: 4 (AAAT) Input Set #4 String1: TAGTAGTAGTAGTAGTAG String2: CATCATCATCATCA Output: 8 (ATATATAT) or 8 (CACACACA) Example of Method Rule for T[i,j]: if (string1[i] == string2[j]) T[i,j] = T[i-1,j-1] + 1 otherwise T[i,j] = maximum (T[i-1,j], T[i,j-1]) string1 = “AAAATTTT” string2 = “TAAATG” string1 = “ATCCGACAAC” string2 = “ATCGCATCTT” 0 A A A A 0 0 0 0 T T T T A T C C G A C A A C 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 T 0 2 2 2 2 2 2 2 A 0 3 3 3 3 3 3 A 0 4 4 4 4 4 4 A 0 5 T 0 1 2 3 3 4 4 4 4 6 G 0 1 2 3 3 4 4 4 4 +1 A 0 1 T 0 1 2 C 0 1 2 G 0 1 2 C 0 A 0 +1 2 +1 3 3 +1 1 2 3 +1 3 3 +1 +1 4 +1 4 +1 1 +1 4 5 +1 2 3 4 4 5 0 1 C 0 1 T 0 1 T 0 1 2 3 +1 4 4 5 +1 5 5 5 6 6 5 6 6 6 +1 3 4 4 5 6 6 6 7 2 3 4 5 6 6 6 6 7 2 3 4 5 6 6 6 6 7 +1 1 +1 1 1 +1 2 +1 2 0 +1 1 +1 2 +1 3 0 +1 0 +1 0 +1 0 +1 0 1 1 1 1 1 1 1 1 1 2 2 2 2 +1 +1 2 +1 3 3 +1 3 +1 3 +1 3 +1 +1 2 +1 1 0 +1 +1 +1 +1 0 +1 +1 +1 +1 T 0