Longest Common Subsequence

advertisement
Assignment 3 – Longest Common Subsequence
Dean Zeller1
CS33001
Fall, 2006
Objective:
Due: Thursday, September 21st by 2:00 PM
10 points
The student will write a C++ program to implement a dynamic programming
approach to the longest common subsequence problem.
Topics covered
This assignment is a review of your skills as a programmer using the string and two-dimensional
array data structures.
Background: Bioinformatics
From Wikipedia:
Deoxyribonucleic acid (DNA) is a nucleic acid — usually in the form of a double helix — that contains the genetic
instructions (or genocode) monitoring the biological development of all cellular forms of life. DNA is a long
polymer of nucleotides and encodes the sequence of the amino acid residues in proteins using the genetic code of
nucleotides. DNA is thought to date back to between approximately 3.5 to 4.6 billion years ago.
Computers execute machine code, a series of 0’s and 1’s. The machine code for living organisms is DNA, a
sequence of four nucleotides: adenine, cytosine, guanine, and thymine. Machine code and DNA are very
similar in theoretical structure. Thus, a technique that is useful in computer science can also be useful in
genetics. DNA can be represented digitally by long strings of A, C, G, and T. One of the most simplistic tests
to determine genetic similarity is the longest subsequence of nucleotides common to both species. Determining
this similarity value is of great use to bioinformatic scientists to determine position on an evolutionary tree.
(Evolution trees will be discussed later in the semester.)
It is a common problem in computer science to determine the longest common substring of two strings,
particularly useful in web search engines. In bioinformatics, the largest subsequence of two strings is far more
useful (and difficult) to determine. It is simple to write a brute-force program to solve the problem. DNA
strings can be millions of characters long, and thus a brute-force method could take years (or even centuries) to
give results for real-world data. This assignment implements a dynamic programming method of efficiently
solving the longest common subsequence problem.
Program Requirements
Input
Prompt the user to enter two strings, representing the DNA sequence of two species.
Alternately, the strings can be read in from a file.
Error checking Before executing the algorithm, check each character in both strings to ensure only
acceptable DNA letters are entered. Do not run the algorithm on input with erroneous
data. In the event of an error, indicate how many characters are illegal, and recreate
the string with the illegal letters highlighted (example: AGCTSACT  agctSact).
Output
The program should output the dynamic subsequence table and the length of the
longest subsequence. The actual subsequence string can be printed for extra credit.
Program Architecture
Functions
Documentation
1
Use functions to modularize your code. The following is a suggestion on how to
organize your functions.
functions.h
header file for functions
functions.cpp implementation of the functions
main.cpp
main program implementation that calls the functions
Each function must include a documentation block describing the input, output, and
method of solving the problem. Create documentation for the main program
describing the program overall. Use proper indentation and self-descriptive variable
names.
Written by Dean Zeller. Edited by Jasmine Boscom, Daryl Popig, and John Withers.
Grading
You will be graded on the following criteria:
Accuracy
Correctly implementing and using the LCS algorithm
Readability Documentation, descriptive variable names, and indenting
Testing
Ensuring the code works for a variety of inputs
Creativity
Using new methods to solve the problem
Extra Credit
Extra points will be given for including the following features:
File input
Allow the user to enter a filename with the two input strings
Subsequence
Print the longest common subsequence in addition to the length
Brute-force
In addition to the dynamic programming method described in this assignment, also
implement a brute-force solution to the problem. Create a short report that describes
the program duration for a variety of input lengths.
Experiment
Reconstruct the evolution tree on the species described in Input Set #5. Run your
program on each pair of species in Input Set #5 below. Use the similarity values to
reconstruct a possible evolution tree, in which species that are more similar are closer
on the tree.
Turning in your assignment
1. Print all C++ and header files.
2. Print a test of your program with input sets given below.
Algorithm Pseudocode
Given below is the pseudocode to solve the problem. You are to implement the pseudocode in object-oriented
C++ functions. Note that this pseudocode has not yet been tested.
Step 1: initialize variables
string s = string1
string t = string2
Twidth = s.length + 1
Theight = t.length + 1
int T[Twidth][Theight]
for (i=0; i<Twidth; i++)
T[i,0] = 0
end-for
for (j=0; j<Theight; j++)
T[0,j] = 0
end-for
Step 2: Run LCS algorithm
for (j=1; j<Theight; j++)
for (i=1; i<Twidth; i++)
if (s[i] == t[j])
// match in subsequence
T[i,j] = T[i-1,j-1] + 1
else
// no match, use previous value
T[i,j] = max (T[i-1,j], T[i,j-1])
end-if
end-for
end-for
subsequenceLength = T[Twidth][Theight]
Step 3: Determine subsequence string (extra credit)
To recreate the string, you must keep track of the times the algorithm found a matching letter in the
sequence. The exact method to keep track of the sequence matches is up to you.
Test Input Sets
Test your program on the following sets of input data.
Input Set #1
String1: ATCCGACAAC
String2: ATCGCATCTT
Output: 7 (ATCGCAC)
Input Set #5 (extra credit)
Use your program to determine which of the following DNA
sequences are most similar. This input set is to simulate a real
genetic experiment. In a real experiment there could be many
species, each DNA sequence thousands of characters long.
String1: GTCACTTCACGGGTACAGACTTAAACG
String2: ACCATTACGGCGATACCAGGATAC
String3: TTTTATTTAGGACAGACTAGACCAGGT
String4: CCGTAGATCGATACGATACCCACCTCAGG
String5: CCGTAGATCGATACGATACCCACCTCAGG
Input Set #2
String1: AACGTTCOGMA
String2: GGATACCASAT
Output: errors in String1 (aacgttcOgMa)
error in String 2 (ggataccaSat)
Input Set #3
String1: AAAATTTT
String2: TAAATG
Output: 4 (AAAT)
Input Set #4
String1: TAGTAGTAGTAGTAGTAG
String2: CATCATCATCATCA
Output: 8 (ATATATAT) or 8 (CACACACA)
Example of Method
Rule for T[i,j]:
if (string1[i] == string2[j])
T[i,j] = T[i-1,j-1] + 1
otherwise
T[i,j] = maximum (T[i-1,j], T[i,j-1])
string1 = “AAAATTTT”
string2 = “TAAATG”
string1 = “ATCCGACAAC”
string2 = “ATCGCATCTT”
0
A
A
A
A
0
0
0
0
T
T
T
T
A
T
C
C
G
A
C
A
A
C
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
T
0
2
2
2
2
2
2
2
A
0
3
3
3
3
3
3
A
0
4
4
4
4
4
4
A
0
5
T
0
1
2
3
3
4
4
4
4
6
G
0
1
2
3
3
4
4
4
4
+1
A
0
1
T
0
1
2
C
0
1
2
G
0
1
2
C
0
A
0
+1
2
+1
3
3
+1
1
2
3
+1
3
3
+1
+1
4
+1
4
+1
1
+1
4
5
+1
2
3
4
4
5
0
1
C
0
1
T
0
1
T
0
1
2
3
+1
4
4
5
+1
5
5
5
6
6
5
6
6
6
+1
3
4
4
5
6
6
6
7
2
3
4
5
6
6
6
6
7
2
3
4
5
6
6
6
6
7
+1
1
+1
1
1
+1
2
+1
2
0
+1
1
+1
2
+1
3
0
+1
0
+1
0
+1
0
+1
0
1
1
1
1
1
1
1
1
1
2
2
2
2
+1
+1
2
+1
3
3
+1
3
+1
3
+1
3
+1
+1
2
+1
1
0
+1
+1
+1
+1
0
+1
+1
+1
+1
T
0
Download