Nifty Bioinformatics Assignments - Computer Science

advertisement
Nifty Bioinformatics Assignments
By Dean Zeller
Kent State University
Summer, 2007
dzeller@cs.kent.edu
http://www.cs.kent.edu/~dzeller
Table of Contents
Assignment 1 – DNA Modeling ..................................................................1
Assignment 2 – Evolution Models I (Binary Evolution Trees) ...................3
Assignment 3 – Evolution Models II (Incremental K-Leaf Root) ...............6
Assignment 4 – DNA Pattern Matching Statistics .......................................8
Assignment 5 – DNA Visualization ..........................................................13
Assignment 6 – Longest Common Subsequence .......................................17
Introduction
The following is a collection of assignments relating to bioinformatics.
All assignments have been written in the past two years and were tested in
an actual classroom environment. Bioinformatics itself is an involved and
complex field, combining concepts of biology, mathematics, and computer
science. However, the introductory topics of bioinformatics are not
difficult, and need not require a college level mentality. The prerequisite
knowledge for these assignments are purposely low, and are intended to
“spark” potential interest within students.
Copyright notice
These assignments and accompanying materials are copyright ©2007
Dean Zeller are all available at the author’s web site. Permission of these
assignments is granted for the copy and use for personal or educational
purposes. Please contact the author with questions, comments, or
feedback.
Assignment 1 – DNA Modeling
Dean Zeller
C&I 47330
Fall, 2006
Due: _________________
10 points
Objective
The student will create structurally accurate models of DNA out of pipe cleaners.
Materials
pipe cleaners, scissors, 3x5 card (optional), hole punch (optional)
Level
0 – no experience required
Topics
This assignment introduces students to concepts of modeling using DNA as the modeled system. It is
appropriate for any age over 5, is challenging, and contains of breadth of useful skills in a wide variety
of areas. This assignment is suitable for any mathematics, science, or art class. No prior experience or
knowledge is required to complete the assignment.
Background: Modeling
Chemists, mathematicians, and many other scientific professions use models as visual representations of
structures. One important aspect of modeling is structural accuracy, i.e. the extent to which it
represents key elements of the system it is modeling. In this assignment, the main element to represent
about DNA is the bonding of AT and CG nucleotide pairs. As you increase in skills, there are other
elements that can be added to increase accuracy.
When crafting a model, structural integrity (or stability) is important for the model to keep its
shape. It cannot fall apart while being analyzed or modified. There are many materials used in
modeling of this nature. For this assignment, the modeling material is pipe cleaners because of their
ability to bend and keep their shape. The instability of a pipe cleaner model is in how the pieces are put
together. Pieces can be connected through a variety of twisting, wrapping, and coiling. Doing this
correctly takes skill and practice.
Instructions
Follow the steps below to create a single DNA strand of 10 nucleotides. Create additional strands as
time permits. See the poster for picture diagrams of the instructions.
Step 0: Setup
A. Select colors for DNA stalks and nucleotides (A, T, C, G).
B. Cut the nucleotide pipe cleaners in fourths.
C. Put the nucleotides into AT and CG pairs.
Step 1: Create nucleotides
Use these steps to create ten (10) nucleotide pairs.
A. Select two pipe cleaner pieces of the appropriate colors to form a nucleotide pair (AT or CG).
B. Put in a V-shape with a small overlap (1/4”). Choose a long end and the corresponding short end
on the same side of the V.
C. Coil the long end three or four revolutions around the other short end, covering it completely.
There should be some extra left over to attach to a stalk.
D. Bend the remaining long and short ends 90° to form a T-shape.
E. Repeat step 1C with the remaining long and short ends.
Nifty Bioinformatics Assignments – Page 1
Step 2: Connect Nucleotides to Stalks
Attach nucleotides to stalks to form a “ladder”
A. Attach one side of nucleotide to stalk by wrapping extra length around stalk twice (once on the
left, and once on the right). Wrap any remaining length around itself.
B. Repeat for other nucleotides, spreading evenly across the stalk.
C. Attach opposite stalk in similar way. Leave enough room at ends to connect to other DNA
strands.
Step 3: …and do the Twist
In “ladder” form, the DNA strand may look inconsistent in size. When twisted into a double-helix form,
the inconsistencies tend to work themselves out.
Further Activities
 Create a model of an actual DNA sequence.
 Longer strands: strands of length 100 can be completed in about two hours, once the procedure is
learned and practiced.
 In groups, use an “assembly line” approach to create a very long DNA strand.
 Class contests: longest DNA strand, strongest DNA strand
Grading
You will be graded on the following criteria:
Accuracy
Correct representation of AT/CG pairs in the model
Stability
Structural integrity of the model
Effort
Quantity, length, and quality of models created
Extra credit will be given for the following:
 Use a 3x5 card to label the model with the appropriate model nucleotide characters (both sides of
the DNA strand).
 Connect models together to create longer models
Nifty Bioinformatics Assignments – Page 2
Assignment 2 – Evolution Models I (Binary Evolution Trees)
Dean Zeller
CS10051
Spring, 2006
Due: _________________
10 points
Objective
The student will use a graphics package to create diagrams of binary evolution trees.
Level
1 – use of a graphics package and/or word processor required
Readings
Read, R.C., and R.J. Wilson, An Atlas of Graphs, Oxford Science Publications, 1999.
Background: Evolution Trees
This assignment deals with the cutting-edge topic of bioinformatics. It is a complex field of graph
theory with applications to mathematics, computer science, and genetics. An evolutionary tree (or
phylogeny) is a tree-structure that demonstrates evolution of species over time. A tree consists of
vertices (nodes) connected by edges (links). In evolution, a node represents a point in which a species
population “splits” into two genetically different species. Once a species splits, the two species created
are genetically unable to produce offspring. The leaves of the tree represent the extant (non-extinct)
species.
Background: Isomorphism
Two trees are isomorphic if they contain the same structure and are different only through symmetry of
a node. In order to simplify the problem, isomorphic trees are considered the same for purposes of this
assignment.
y
y
y
y
A
B
C
D
x
z
x
z
x
z
x
z
(1242)
F
E
(1244a)
(1242)
(1242)





(1244b)
(1242)
Figure 1 – Isomorphic trees
A and B are isomorphic at node x.
B and C are isomorphic at node y.
C and D are isomorphic at node z.
A and D are isomorphic at node y.
As such, A, B, C, and D are all isomorphic, and thus
considered the same for purposes of this assignment.
 E and F are not isomorphic because they are
structurally different.
Background: Classification labels
In order to easily name trees, they are given a text description called a classification label. This allows a
text label instead of a visual picture to represent a tree structure. A good classification system has a
unique name for each tree. For this assignment, the classification system is simply a listing of the
number of nodes at each level. Trees A, B, C, and D above have the label (1242), indicating the first
level has one node, the second has two nodes, the third has four nodes, and the fourth has two. Since the
four trees are isomorphic, the same label represents all four trees. Ultimately, this classification system
is incomplete. While simple to understand, it does not provide unique names for each tree at the higher
levels. Trees E and F are not isomorphic, and thus are given separate labels (1244a and 1244b). This
system will suffice for now, but can get confusing as the size of the trees increase.
Background: Assumptions
Introductory study of phylogenies makes another assumption that greatly simplifies the problem. For
purposes of this assignment, the only non-leaf structure possible are nodes with exactly two offspring,
Nifty Bioinformatics Assignments – Page 3
representing a point in time in which a population “splits” into two populations. This is called a binary
evolution tree. A node with a single offspring is called a redundant node and does not significantly add
to the tree structure. Nodes with three or more children can be isomorphically approximated, and thus
can be ignored at this stage with only a minimal loss of information. While all nodes in evolution trees
are unique species, at this point only the leaf nodes need to be considered.
Task 1 – Draw Given Trees (5 points)
Given below are thirteen evolution trees. This
represents the complete set of all non-isomorphic
evolution trees of up to six leaves (11 nodes). Use a
graphics package to recreate these trees. Give the
classification for the tree and label its leaves with
successive letters (a, b, c, etc…) Your design style
may differ from the trees below, but the structure must
be correct.
a
(1222)
(122)
(12)
b)
Figure 2 – Tree structure replacements
a) Redundant node removed
b) Isomorphic approximation
(12222)
(124)
c
b
a)
e
d
c
a b
d
a b c d
c
a b
a b
(1224)
(1242)
(122222)
f
e
e
c
a b c d
d
d
e
c
a b
a b
(12224)
(12242)
(12422)
f
f
e
d e
c
a b c d
d
e
c
a b
a b
(1244a)
(1244b)
e
a b c d
f
f
c
a b
d
e
f
Nifty Bioinformatics Assignments – Page 4
Task 2 – Generate Trees (5 points)
Use a graphics package to create the eleven isomorphically unique evolutionary trees of 7 leaves (13
nodes). The following labels are the classifications for the unique trees: 1222222, 122224, 122242,
122422, 12244a, 12244b, 124222, 12424, 12442a, 12442b, and 1246.
Grading:
You will be graded on the following criteria:
Accuracy
Correctly drawing the evolutionary trees, attention to detail.
Creativity
Style, appearance, and consistency of the trees
Extra credit will be given for including the following:
 Create the sixteen 8-leaf trees: 12222222, 1222224, 1222242, 1222422, 1224222, 1242222, 124224,
124242, 124422a, 124422b, 12444aa, 12444ab, 12444ba, 12444bb, 12462a, 12462b, 1248.
 Create the 9 leaf (17 node) non-isomorphic trees and classifications.
 Create the trees without the redundant node and isomorphic approximation assumptions, allowing
for any number of offspring from a node. This will exponentially increase the number of possible
trees.
Graphs Diagram Drawing Guidelines
Creating well-drawn graphs is an art form in itself. A diagram is a visual representation, and care should
be taken to make sure diagrams are organized, readable, and easily understood. When using a graphics
package to draw graphs, it is imperative that students use a consistent and visually pleasing style.
Sloppy graph diagrams stick out like a sore thumb and can completely ruin a professional presentation.
Follow the below guidelines for drawing your graphs. See the figures for examples of following and
violating these guidelines.
Vertices
Labels
Edges
Graph
All vertex markers should be the same size and shape. If necessary, different
markers can represent types of vertices.
It should be clear which vertex the label references. Use a reasonable location
scheme for the labels. Vertex labels should not cover any vertex, edge, or other
label. A logical labeling sequence may help in understanding the graph structure.
All edges should be straight lines of the same thickness and connect to the center
of the vertex markers. Edge highlighting can be done with thicker or grayed lines.
Planar graphs should be drawn without any crossing edges. Curved edges are
generally avoided, but may be used for artistic effect.
The graph shape should be pleasing to the eye. Use a regular polygon or other
meaningful shape whenever possible. Physical distances between vertices should
be relatively consistent. Use the alignment and distribution tool within the
graphics package.
h
b
h
b
g
c
d
c
a
b
e
a
c
e
a
e
d
d
b
f
g
e
c
d
f
a
Nifty Bioinformatics Assignments – Page 5
Assignment 3 – Evolution Models II (Incremental k-Leaf Root)
Dean Zeller
CS10051
Spring, 2006
Due: Wednesday, March 22nd by 9:00 pm
10 points
Objective
The student will build on assignment 2 to create diagrams of the incremental k-leaf root
evolution model.
Level
1 – use of a graphics package and/or word processor required
Readings
Read, R.C., and R.J. Wilson, An Atlas of Graphs, Oxford Science Publications, 1999.
Background: k-leaf root, cliques
As an alternate to phylogenies, evolution can be modeled by a series of incremental graphs called k-leaf
roots that contain visual information indicating the distances between species. The tree leaves serve as
the graph vertices. Edges within the graph indicate the distance between the two species is no more than
a specified k-value in a corresponding tree. A clique is defined as a subset of vertices such that all
members are connected to all other members within the clique. An incremental k-leaf root model is
essentially a series of (possibly overlapping) cliques. In the diagrams below, the corresponding cliques
are labeled below the graphs.
Instructions
1. Use a word processor to create a table similar to the example below.
a. Set your document Page Setup to Landscape to give more space widthwise.
b. The first cell in each row should contain the phylogenies from assignment 2.
c. The column header should contain increasing k-values from two (2) up to the number of
leaves in the tree.
2. Within each row…
a. Draw the k-leaf root for increasing k-values. Start at k=2 and end when the graph is fully
connected. Draw a left arrow () for graphs that show no change from the previous kvalue.
b. Indicate the cliques created within each distance graph.
3. Create a phylogeny with at least 10 leaves and the corresponding incremental k-leaf root
evolution model.
4. Review the graphs drawn so far. Indicate which distance graphs you think contain the most
useful information for geneticists.
Grading
You will be graded on the following criteria:
Accuracy
Correctly drawing the evolutionary trees.
Creativity
Style, appearance, and consistency of the graphs
Extra credit will be given for any of the following:
 Create more complex phylogenies for the report.
Nifty Bioinformatics Assignments – Page 6
phylogeny
(12)
2 leaves (3 vertices)
k=3
k=2
a
a
k=4
k=5
k=4
k=5
b
ab
b
phylogeny
3 leaves (5 vertices)
k=3
a
k=2
a
(122)
c
c
a b
phylogeny
(1222)
b
ab
c
abc
b
4 leaves (7 vertices)
k=3
k=2
a
k=4
b
a
b
a
c
d abc, cd c
d
k=5
b
d
c
d
a b
(124)
ab
b
a
d ab, cd c
d
a
abcd
c
b
a b c d
abcd
5 leaves (9 vertices)
k=2
k=3
a
a
phylogeny
(12222)
e
e
b
e
c
k=4
a
b
k=5
a
e
b
e
b
d
c
d
a b
c
d
abc, cd, ef
c
ab
c
d
abcd, cde
c
abcde
a
a
(1224)
d
e
b
e
b
e
c
d
a b c d
ab, cd
d
c
abcde
Nifty Bioinformatics Assignments – Page 7
Assignment 4 – DNA Pattern Matching Statistics
Dean Zeller
CS10061
Spring, 2007
Due: _________________
10 points
Objective
The student will create a Python program to calculate pattern matching statistics on DNA
sequences.
Level
2 – understanding and use of the Python programming language
Background: Bioinformatics
DNA can be represented digitally by long strings of A, C, G, and T. In this assignment, you will
develop a program to perform statistics on a given DNA sequence.
Definitions
Sequence
DNA Sequence
Pattern
A long string of characters.
A long string of A’s, C’s, G’s, and T’s representing a DNA strand.
A short string of characters to be searched within a sequence
Concepts Introduced
Strings as character arrays
# can access each character in a string individually
name = "Dean Zeller"
print name[0], name[1], name[2], name[3]
String length
# len(string) returns the length of a string in characters
print len(name)
for i in range(len(name)):
print name[i],
File input
# input from file instead of from the user
inputfile = open("input.txt", "r")
contents = inputfile.read() # contents contains entire file
inputfile.close()
File output
# output to file instead of to the IDLE window
outputfile = open("output.txt", "w")
outputfile.write("This is my output. 4-3-2-1-Wheee! ")
…
# when finished writing, close the file
outputfile.close()
While loops
# similar to if, but continues looping until condition is false
response = raw_input("Would you like to do something? ")
while response.upper()=="Y":
print "Okay, do something! "
response = raw_input("Would you like to do it again? ")
Current Programs
In DNAstats.py, user input consists of a filename containing a sequence of DNA. The output is a report
of statistics for short patterns (i.e. number of A’s, number of C’s, etc.). It then allows the user to search
for patterns of any length within the given sequence. RandomDNA.py is a program to generate
synthetic DNA randomly for testing purposes. Read, understand, and test these programs before starting
on the tasks.
Nifty Bioinformatics Assignments – Page 8
Tasks
DNAstats.py contains the framework for a program to statistically analyze DNA sequences.
Implement at least three of the tasks described below. More can be completed for extra credit.
1.
2.
3.
4.
5.
Implement all two-character patterns (16) in the statistics report.
Implement all three-character patterns (64) in the statistics report.
Put the entire main program into a while loop to allow the user to run statistics on multiple files.
Write the statistics report and patterns searched to an output file. Prompt the user for the name of the output file.
Use the program to find long patterns that occur more than once, short patterns that do not occur, and relationships
between the patterns.
6. Find actual DNA sequences on the web and run them through the program. Research some meaningful patterns and
look for them in some DNA sequences.
7. Use the program to create the data for a table similar to table 1. You may use Microsoft Excel to enter the values
and calculate the percentages.
8. Repeat all analyses on the reverse of the string. To do this, create a new string that is the characters of the old string
in reverse order and send through the analysis methods.
9. Create a loop to cycle through all possible patterns. This is not an easy task, but if someone is up to a challenge, see
me for help.
10. Modify the randomDNA.py program to create random or weighted sequences. Compare your analysis with actual
DNA files.
11. Own idea: write up a task idea of your own and implement within the program. Get your idea approved first by
your instructor.
Documentation
Your instructor originally wrote this code, as noted in the documentation. Create documentation blocks
for any new functions you create, and list yourself as the author.
Turning in your assignment
1. Print a copy of your DNAstats program. If you made significant modifications to the
randomDNA program, print that as well.
2. Print a test your program with the given input files and others that you create.
3. Write a report briefly describing the tasks completed.
4. Informally demonstrate the tasks completed to the class
Grading
You will be graded on the following criteria:
Quantity
Variety of tasks correctly implemented
Readability Documentation indicating the lines of code created and modified
Testing
Testing your program on all input files
Creativity
Use new methods to solve the problem
Extra Credit
Extra points will be given for including the following features:
Extra quantity
Implement more than three tasks
Report/Analysis
Include a formal report and spreadsheet of your analysis
Table 1: DNA Statistics
A
A
C
G
T
total
avg #
avg %
C
Pat
#
% Pat
#
AA 21756 35% AC 10882
CA 13824 22% CC 9404
GA 9802 16% GC 9117
TA 17170 27% TC 13128
62552
42531
15638
10633
25%
G
%
26%
22%
21%
31%
25%
Pat
#
AG 9815
CG 5999
GG 6102
TG 12528
34444
8611
T
%
28%
17%
18%
36%
Pat
#
% total avg # avg %
AT 20100 31% 62553 15638 30%
CT 13304 20% 42531 10633 20%
GT 9423 14% 34444 8611 17%
TT 22839 35% 65665 16416 32%
65666
16417
25%
25%
Nifty Bioinformatics Assignments – Page 9
DNAstats.py
########################################################################
#
#
#
DNA Statistics 1.00
#
#
Written by: Dean Zeller
#
#
#
#
This program was written for demonstration purposes for CS10061
#
#
(Introduction to Programming) at Kent State University.
#
#
(C) 2007 Dean Zeller
#
#
#
#
This program performs pattern matching statistics on DNA
#
#
sequences.
#
#
#
########################################################################
########################################################################
#
#
#
dnaTools
#
#
This object contains various methods to perform pattern-mathing
#
#
statistics on a given DNA sequence.
#
#
#
########################################################################
class dnaTools(object):
####################################################################
#
printTitle
#
#
Print the title centered in a space width characters wide
#
####################################################################
def printTitle(self, title, width):
print "+" + "-"*(width-2) + "+"
print "|" + title.center(width-2) + "|"
print "+" + "-"*(width-2) + "+"
####################################################################
#
removeErrors
#
#
Return DNA sequence with non-ACTG characters removed.
#
####################################################################
def removeErrors(self,seq):
newSeq = ""
for i in range(len(seq)):
if seq[i]=="A" or seq[i]=="C" or seq[i]=="G" or seq[i]=="T":
newSeq = newSeq + seq[i]
return newSeq
####################################################################
#
stats
#
#
Perform basic statistics on a given DNA sequence
#
####################################################################
def stats(self, seq):
numA=0
numC=0
numG=0
numT=0
numX=0 # number of errors
numAT=0
numCG=0
numCAT=0
numACT=0
for i in range(0,len(seq)):
if seq[i]=='A':
numA = numA+1
elif seq[i]=='C':
numC = numC+1
elif seq[i]=='G':
numG = numG+1
elif seq[i]=='T':
numT = numT+1
else:
numX = numX+1
# 1-character patterns
Nifty Bioinformatics Assignments – Page 10
for i in range(1,len(seq)):
# 2-character patterns
if seq[i-1]=='A' and seq[i]=='T':
numAT = numAT+1
elif seq[i-1]=='C' and seq[i]=='G':
numCG = numCG+1
for i in range(2,len(seq)):
# 3-character patterns
if seq[i-2]=='C' and seq[i-1]=='A' and seq[i]=='T':
numCAT = numCAT+1
elif seq[i-2]=='A' and seq[i-1]=='C' and seq[i]=='T':
numACT = numACT+1
w=5
self.printTitle("Statistics Report",80)
if numX > 0:
print "WARNING -- There were",numX,"errors
print
"A:".rjust(w),
str(numA).rjust(w), '
print
"C:".rjust(w),
str(numC).rjust(w), '
print
"G:".rjust(w),
str(numG).rjust(w), '
print
"T:".rjust(w),
str(numT).rjust(w)
print "AT:".rjust(w), str(numAT).rjust(w), '
print "CG:".rjust(w), str(numCG).rjust(w), '
print "CAT:".rjust(w), str(numCAT).rjust(w), '
print "ACT:".rjust(w), str(numACT).rjust(w), '
print
in the sequence."
'.rjust(w),
'.rjust(w),
'.rjust(w),
'.rjust(w),
'.rjust(w)
'.rjust(w),
'.rjust(w)
####################################################################
#
matchSimplePattern
#
#
Count the number of occurrances of pat (pattern) within
#
#
seq (sequence) using a simple string-matching algorithm.
#
####################################################################
def matchSimplePattern(self, pat, seq):
numPat=0
for i in range(len(seq)):
if seq[i:i+len(pat)]==pat:
numPat += 1
print numPat
########################################################################
#
main program
#
#
Demonstrate the methods defined in the dnaTools object.
#
########################################################################
# get input file
inputFileName = raw_input("Enter DNA Sequence filename: ")
dnaFile = open(inputFileName,"r")
dnaSequence = dnaFile.read()
dnaFile.close()
# set up tools
D = dnaTools()
# run stats on original sequence
D.stats(dnaSequence)
# run stats on fixed sequence
dnaFixed = D.removeErrors(dnaSequence)
D.stats(dnaFixed)
# allow user to search for patterns
while True:
pattern = raw_input("Enter search pattern (blank to exit): ")
if pattern=="":
break
D.matchSimplePattern(pattern.upper(),dnaFixed)
print "Thanks, and have a nice day."
Nifty Bioinformatics Assignments – Page 11
randomDNA.py
import random
size = 10000
# Equal probability
sequence = ""
for i in range(size):
r = random.randrange(1,5)
if r==1:
sequence += "A"
elif r==2:
sequence += "C"
elif r==3:
sequence += "G"
else:
sequence += "T"
print
print size,"character random DNA sequence:"
print sequence
outputfile = open("random.txt","w")
outputfile.write(sequence)
outputfile.close()
# Weighted probability
sequence=""
weightA = 10
weightC = 15
weightG = 5
weightT = 13
total = weightA + weightC + weightG + weightT
for i in range(10000):
r = random.randrange(0,total)
if r < weightA:
sequence += "A"
elif r < weightA + weightC:
sequence += "C"
elif r < weightA + weightC + weightG:
sequence += "G"
else:
sequence += "T"
print
print size,"character weighted DNA sequence:"
print "weightA =",weightA,weightA*1.0/total
print "weightC =",weightC,weightC*1.0/total
print "weightG =",weightG,weightG*1.0/total
print "weightT =",weightT,weightT*1.0/total
print sequence
outputfile = open("weighted.txt","w")
outputfile.write(sequence)
outputfile.close()
Nifty Bioinformatics Assignments – Page 12
Assignment 5 – DNA Visualization
Dean Zeller
CS10061
Spring, 2007
Due: _________________
10 points
Objective
The student will write a Python program to implement the chaos-game representation of
DNA.
Level
2 – understanding and use of Python with the Tkinter graphics library
Readings
H. Joel Jeffrey (1990). “Chaos game representation of gene structure.” Nucleic Acids
Research, Vol 18, No. 8, pp 2163-2170.
Background: Chaos Game Representation (CGR)
It is difficult for humans to look at a bunch of numbers and find meaningful patterns. It is far more
effective to provide a visual represention the statistics. Bar charts, pie charts, and line graphs are used to
represent data in a visual manner. The chaos game representation is just one of many DNA
visualization algorithms. Analyzing the data in this format can show some characteristics about the
DNA structure. DNA sequences can be found at the National Center for Biotechnology Information
website at http://www.ncbi.nlm.nih.gov. Interesting fractal patterns can be created from contrived
examples of DNA. For example, a random sequence will fill the square uniformly, but a random
sequence without any A’s will create the famous Sierpinski triangle fractal.
Input
The user must provide parameters for the drawing of the visualization. Collect these values from the
user before execution. You may create more parameters as needed.
Filename:
the file containing the DNA sequence
Interval size: how often to pause execution
Dot size:
the radius of the dot created at each point
Output
Follow the CGR algorithm to draw the appropriate dots for a given DNA sequence. Use the Interval
Size variable to pause the drawing process for the user.
Testing
Test your program on a wide range of actual DNA sequences, random sequences, and repeated patterns.
Compare the visualization to the report generated, and make note of any patterns found.
Tasks
Implement at least three of the following tasks using your chaos automata visualization:
1.
2.
3.
4.
5.
6.
7.
Implement the chaos-automata algorithm using Python.
Create visualizations for actual DNA sequences for organisms.
Use the randomDNA.py program to generate DNA files with specific characteristics to create different artistic
patterns.
Create guidelines similar to the diagrams above to indicate the different quadrants.
Allow the user to specify the characters represented for each corner.
Change the color of the dot every so often. Or allow the user to change the color at every interval.
Own idea: write up a task idea of your own and implement within the program. Get your idea approved first by
your instructor.
Documentation
This assignment builds on the previous assignment. Create documentation blocks for any new functions
you create, and list yourself as the author.
Nifty Bioinformatics Assignments – Page 13
Turning in your assignment
1. Print a copy of your CGR program. If you made significant modifications to the randomDNA
program, print that as well.
2. Print at least three interesting patterns generated by your program.
3. Write a report briefly describing the tasks completed.
Grading
You will be graded on the following criteria:
Quantity
Variety of tasks correctly implemented
Readability Documentation indicating the lines of code created and modified
Testing
Testing your program on a variety of input files
Creativity
Use new methods to solve the problem
Extra credit will be given for including the following features:
Extra quantity
Implement more than three tasks
Report/Analysis
Include a formal report and spreadsheet of your analysis
CGR algorithm
This assignment combines concepts from the previous assignment on DNA sequencing and material
from the graphics assignments. This assignment will implement the CGR algorithm to visualize the
input DNA. The algorithm is actually quite simple, once the graphics procedures are understood.
Step 0: Create the DNA Square.
Based on parameters from the user, create a square within the canvas where all points will be
drawn. You will need to know the left, top, bottom, and right positions within the square.
Step 1: Initial Point
Create two variables, x and y, denoting the point to draw. The initial point should be the center
of the square. Note: you do not draw the initial point; it just serves as a starting point.
x = (left + right)/2
y = (left + right)/2
Step 2: Draw dots
For each character in the DNA sequence
a) Calculate the next point to draw, which is halfway between the current point and the
appropriate corner for the letter. (A: top-left, C: top-right, G: bottom-right, T: bottom-left)
b) Draw the current point
Pseudocode
The following is pseudocode for the CGR algorithm. Your job is to implement the algorithm using
Python.
x = (left + right)/2
y = (top + bottom)/2
for i each character in seq
if seq[i]==’A’
x = (x+left)/2
y = (y+top)/2
else if seq[i]==’C’
x = (x+right)/2
y = (y+top)/2
else if seq[i]==’G’
x = (x+right)/2
y = (y+bottom)/2
else if seq[i]==’T’
x = (x+left)/2
y = (y+bottom)/2
drawDot(x,y)
Interrupting Execution
The following Python code will interrupt a loop every
interval times.
interval = 10
for i in range(1000):
print i,
if i%interval == 0:
raw_input(“press return to continue”)
Nifty Bioinformatics Assignments – Page 14
Example: Sequence: “ATAGCCTGTGA”
A
Initial Setup
T
A
AT + A
T
A
ATAGC + C
T
A
T
ATAGCCTG + T
C
A
G
T
C
A
G
T
C
A
G
T
C
A
G
T
+A
ATA + G
ATAGCC + T
ATAGCCTGT + G
C
A
G
T
C
A
G
T
C
A
G
T
C
A
G
T
A+T
C
G
ATAG + C
C
G
ATAGCCT + G
C
G
ATAGCCTGTG + A
C
G
Nifty Bioinformatics Assignments – Page 15
Analysis
Consider the diagrams below. The first contains additional guidelines for different size patterns. The
final diagram is the chaos-automata result for the sequence ATAGCCTGTGA. Note the correspondence
between the final pattern.
A
C
A
C
A
C
AA
CA
AC
A
C
AAA CAA ACA CCA AAC CAC ACC CCC
CC
TAA GAA TCA GCA TAC GAC TCC GCC
TA
GA
TC
ATA CTA AGA CGA ATC CTC AGC CGC
GC
TTA GTA TGA GGA TTC GTC TGC GGC
T
G
AT
CT
AG
AAT CAT ACT CCT AAG CAG ACG CCG
CG
TAT GAT TCT GCT TAG GAG TCG GCG
TT
GT
TG
ATT CTT AGT CGT ATG CTG AGG CGG
GG
TTT
T
A
A
ATAGCCTGTGA
C
G
T
C
A
AA
ATAGCCTGTGA
CA
CC
AC
G
T
C
A
GTT TGT GGT TTG GTG TGG GGG
G
ATAGCCTGTGA
C
AAA CAA ACA CCA AAC CAC ACC CCC
TAA GAA TCA GCA TAC GAC TCC GCC
TA
GA
TC
ATA CTA AGA CGA ATC CTC AGC CGC
GC
TTA GTA TGA GGA TTC GTC TGC GGC
T
AT
G
CT
AG
AAT CAT ACT CCT AAG CAG ACG CCG
CG
TAT GAT TCT GCT TAG GAG TCG GCG
TT
GT
TG
ATT CTT AGT CGT ATG CTG AGG CGG
GG
TTT
T
G
A:
C:
T:
G:
3
2
3
3
T
G
CC:
TA:
GA:
GC:
AT:
CT:
AG:
GT:
TG:
1
1
1
1
1
1
1
1
2
GTT TGT GGT TTG GTG TGG GGG
T
G
GCC:
ATA:
AGC:
TGA:
CCT:
TAG:
CTG:
TGT:
GTG:
1
1
1
1
1
1
1
1
1
Nifty Bioinformatics Assignments – Page 16
Assignment 6 – Longest Common Subsequence
Dean Zeller
CS33001
Fall, 2006
Due: _________________
10 points
Objective
The student will write a C++ program to implement and test the dynamic programming
approach to the longest common subsequence problem.
Level
2 – understanding and use of the C++ programming language
Topics covered
This assignment is a review of your skills as a programmer using the string and two-dimensional array
data structures.
Background: Bioinformatics
DNA can be represented digitally by long strings of A, C, G, and T. One of the most simplistic tests to
determine genetic similarity is the longest subsequence of nucleotides common to both species.
Determining this similarity value is of great use to bioinformatic scientists to determine position on an
evolutionary tree.
It is a common problem in computer science to determine the longest common substring of two
strings, particularly useful in web search engines. In bioinformatics, the largest subsequence of two
strings is far more useful (and difficult) to determine. It is simple to write a brute-force program to solve
the problem. However, DNA strings can be millions of characters long, and thus a brute-force method
could take hours (or even years) to give results for real-world data. This assignment implements the
dynamic programming method of efficiently solving the longest common subsequence problem in O(n2)
time.
Program Requirements
Input
Prompt the user to enter two strings, representing the DNA sequence of two
species. Alternately, the strings can be read in from a file.
Error checking Before executing the algorithm, check each character in both strings to ensure
only acceptable DNA letters are entered. Do not run the algorithm on input
with erroneous data. In the event of an error, indicate how many characters
are illegal, and recreate the string with the illegal letters highlighted (example:
AGCTSACT  agctSact).
Output
The program should output the dynamic subsequence table and the length of
the longest subsequence. The actual subsequence string can be printed for
extra credit.
Program Architecture
Functions
Use functions to modularize your code. The following is a suggestion on how
to organize your functions.
functions.h
header file for functions
functions.cpp implementation of the functions
main.cpp
main program implementation that calls the functions
Documentation Each function must include a documentation block describing the input,
output, and method of solving the problem. Create documentation for the
main program describing the program overall. Use proper indentation and
self-descriptive variable names.
Nifty Bioinformatics Assignments – Page 17
Grading
You will be graded on the following criteria:
Accuracy
Correctly implementing and using the LCS algorithm
Readability Documentation, descriptive variable names, and indenting
Testing
Ensuring the code works for a variety of inputs
Creativity
Using new methods to solve the problem
Extra credit will be given for including the following features:
File input
Allow the user to enter a filename with the two input strings
Subsequence
Print the longest common subsequence in addition to the length
Brute-force
In addition to the dynamic programming method described in this assignment,
also implement a brute-force solution to the problem. Create a short report that
describes the program duration for a variety of input lengths.
Turning in your assignment
1. Print all C++ and header files.
2. Print a test of your program with input sets given below.
Algorithm Pseudocode
Given below is the pseudocode to solve the problem. You are to implement the pseudocode in objectoriented C++ functions.
Step 1: initialize variables
string s = ‘ ‘ + string1
string t = ‘ ‘ + string2
Twidth = s.length
Theight = t.length
int T[Twidth][Theight]
for (i=0; i<Twidth; i++)
T[i,0] = 0
end-for
for (j=0; j<Theight; j++)
T[0,j] = 0
end-for
Step 2: Run LCS algorithm
for (j=1; j<Theight; j++)
for (i=1; i<Twidth; i++)
if (s[i] == t[j])
// match in subsequence
T[i,j] = T[i-1,j-1] + 1
else
// no match, use previous value
T[i,j] = max (T[i-1,j], T[i,j-1])
end-if
end-for
end-for
subsequenceLength = T[Twidth][Theight]
Nifty Bioinformatics Assignments – Page 18
Test Input Sets
Test your program on the following sets of input data.
Input Set #1
String1: ATCCGACAAC
String2: ATCGCATCTT
Output: 7 (ATCGCAC)
Input Set #2
String1: AACGTTCOGMA
String2: GGATACCASAT
Output: errors in String1 (aacgttcOgMa)
error in String 2 (ggataccaSat)
Input Set #3
String1: AAAATTTT
String2: TAAATG
Output: 4 (AAAT)
Input Set #4
String1: TAGTAGTAGTAGTAGTAG
String2: CATCATCATCATCA
Output: 8 (ATATATAT) or 8 (CACACACA)
Example of Method
Rule for T[i,j]:
if (string1[i] == string2[j])
T[i,j] = T[i-1,j-1] + 1
otherwise
T[i,j] = maximum (T[i-1,j], T[i,j-1])
string1 = “AAAATTTT”
string2 = “TAAATG”
string1 = “ATCCGACAAC”
string2 = “ATCGCATCTT”
0
A
A
A
A
0
0
0
0
G
A
C
A
A
C
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
T
0
2
2
2
2
2
2
2
A
0
3
3
3
3
3
3
A
0
4
4
4
4
4
4
A
0
1
2
3
3
5
T
0
1
2
3
3
4
4
4
4
6
G
0
1
2
3
3
4
4
4
4
1
T
0
1
2
C
0
1
2
G
0
1
2
+1
2
+1
3
3
+1
1
2
3
+1
3
3
+1
+1
4
1
+1
+1
4
+1
0
T
C
0
A
T
C
A
0
T
T
0
+1
C
T
A
4
5
+1
2
3
4
4
5
4
4
5
+1
0
+1
1
+1
1
+1
0
+1
1
+1
2
+1
0
+1
1
+1
2
+1
5
5
0
+1
0
+1
0
+1
0
1
1
1
1
1
1
1
1
1
2
2
2
2
2
+1
+1
+1
+1
+1
0
+1
3
+1
3
+1
3
+1
3
+1
+1
5
6
6
5
6
6
+1
T
0
1
C
0
1
T
0
1
T
0
1
2
3
+1
+1
+1
6
+1
2
3
4
4
5
6
6
6
7
2
3
4
5
6
6
6
6
7
2
3
4
5
6
6
6
6
7
+1
+1
Nifty Bioinformatics Assignments – Page 19
Download