Visualization of a simple DNA segment alignment model Background Biotechnology has brought us a number of very powerful tools to uncover many of life’s mysteries. These tools have been developed through the cooperative efforts and skills of molecular biologists, biochemists, mathematicians, physicists, engineers and computer scientists. Exciting problems can now be examined in many areas such as: bioinformatics (Human Genome Project), forensic science (CSI), medicine (AIDS), evolution (phylogeny), genetics, paleontology and anthropology (neanderthal and modern human comparisons). The double stranded DNA molecule, itself often constructed of millions of nucleotides and thousands of codons, comprises the genome or genetic package of every organism on earth. Analysis of mutations or changes in the DNA structure through alignment programs can reveal evolutionary or phylogenetic relationships of organisms. The degree or number of the shared ancestral characters (homologies) reflects the relative closeness of the relationships of species of organisms or even between individuals of the same species. Once the homology is hypothesized, often a difficult but critical step, the phylogenic tree or trees can be generated based upon the degree of alignment of the homologous nucleotides. Often difficulties may arise in separating homologies from homoplasies or analogous characters (e.g. evolutionary convergence, parallel or reversals). On a higher level, shared amino acids, or protein can also be extrapolated from segments of DNA to construct phylogenies. MutationsGenetic mutations, a major source of genetic variation utilized by natural selection, can range from tiny single nucleotide changes or deletions called point mutations (the subject of this module) to macromutations of whole chromosomes changed by chromosomal additions, deletions or inversions. The frequency and probability of each type of mutation is different depending upon the relative scale. This model will examine the variation of these costs in various alignment paradigms. Depictions of nucleic acids often show double stranded diagrams, sometimes as DNADNA anti-parallel strands (i.e., in illustrations of replicating chromosomes http://www.elmhurst.edu/~chm/vchembook/images/582dnarepline.gif or in DNA hybridizations http://www.members.cox.net/amgough/FISH_olgio_hybridization- deep01_01_03.jpg ) where A binds to T and G to C, and the hydrogen bonds between single strands are visualized. In other cases diagrams may show RNA-DNA double strands http://www.biology.arizona.edu/molecular_bio/problem_sets/mol_genetics_of_eukaryote s/graphics/05ta.gif , as occur during transcription, the synthesis of RNA from a DNA template. In RNA U substitutes for T, and all RNA bases have a ribose backbone rather than deoxyribose as in DNA. In the case of DNA alignment sequences, however, diagrams show where two strands of DNA code for the same nucleotide at the same position, and conserved bases are shown as a vertical line extending between the two strands. Remember that, in DNA alignments, we are exploring the relationship of one DNA strand’s order of bases to that of another strand with the intents of determining the level of similarity between them and inferring the origin of that similarity. The model- 1. Alignments as Sequences of Deletions, Insertions, and Substitutions Mathematically, the pairwise DNA-sequence alignment problem begins by providing two sequences S1 and S2 composed from the four characters A, C, G, or T. The following sequences present an example: S1: AGTGTTCCAG S2: AATCGTTACAG An alignment of for two sequences is a record of “edits” (or lack thereof) in the bases in S1 that leads to the sequence S2. Allowable edit operations are: Deletion (D): delete a base Insertion (I): insert a new base Substitution (S): replace a base with another base Notice that a substitution may not represent an actual “edit” as it does not rule out that a base may be “replaced” by exactly the same base. Biologically, this would correspond to a match. Example 1. Below are some possible alignments of the sequences S1 and S2 given as tables of corresponding pairs, together with the list of operations in L, describing the alignment. A hyphen in S1 represents an insertion, while a hyphen in S2 corresponds to a deletion. The sequence L represents the transformation of S1 into S2 by enacting the edits listed in L on S1 form left to right. a) S1: S2: L: A-GTGTTCCAG AATCGTTACAG SISSSSSSSSS (1 insertion, 9 substitutions, 0 deletions) b) S1: -AGT-GTTCCAG c) S2: L: AA-TCGTTACAG ISDSISSSSSSS (2 insertions, 8 substitutions, 1 deletion) S1: S2: L: ---------AGTGTTCCAG AATCGTTACAG-------IIIIIIIIIIDDDDDDDDDD (9 insertions, 2 substitutions, 8 deletions) □ Notice that not any sequence of edits represents an alignment. This only happens when the table of corresponding pairs will result in two sequences of equal length. For instance, L: IISSSSSSSSSS cannot represent an alignment for S1 and S2, since two insertions in the beginning of S1 will result in a string of 12 characters that cannot be matched to a string of 11 characters. If #I, #D, and #S denote respectively the number of insertions, deletions, and substitutions in the sequence L, this observation leads to the following: A sequence L represents an alignment for S1 and S2 only if #I + #S = 10 and #D + #S = 11. For sequences S1 and S2 of arbitrary lengths n and m respectively, this condition generalizes to the following: A sequence L represents an alignment for S1 and S2 only if #I + #S = n and #D + #S = m. Exercise 1. Are all three alignments in Example 1 of equal biological importance? Explain why or why not. Exercise 2. Verify that the conditions #I + #S = 10 and #D + #S = 11 are satisfied for all of the alignments presented in Example 1. Exercise 3. Let S1 and S2 are the same sequences as above. Does each of the sequences L represent an alignment? Explain why or why not. In case L is an alignment, give the table of corresponding pairs as in Example 1. a) L: SSSISSSSSSS b) L: ISSSSSSSSSS c) L: SSSDSSSSSSII d) L: SSIDSSSIDISS Visualizing DNA strands 2. Alignments as Paths on a Graph Once an alignment is understood as a sequence L formed of the characters I, D, and S, a more visual way to represent that alignment is to view it as a path on the alignment graph of the sequences of S1 and S2. If S1 has length n and S2 has length m, the alignment graph is a given as triangular lattice of height n and width m such as that in Figure 1. The DNA bases in S1 label the rows of the lattice while the DNA bases in S2 label the columns. The alignment graph of our example sequences S1 and S2 is presented in Figure 1A. The alignment graph displays sequence 1 on the y-axis and sequence 2 on the x-axis, placing each nucleotide in the order of their appearance on the strands being compared. The alignment graph is a directed graph. In directed graphs, access from one vertex to an adjacent vertex via a connecting edge is allowed only in the specified direction. For the alignment graphs we just described, the possible directions are: one horizontal step from left to right; one vertical step from top to bottom, and one diagonal step from the upper left to the lower right corner. A horizontal directed edge (I) corresponds to a situation where a base in one DNA strand has been inserted in a position not present in the second strand. The practical outcome is that one DNA strand is longer than its comparator. A vertical directed edge (D) corresponds to a situation where a base in one DNA strand has been deleted in a position present in the comparison strand. A deletion yields a shorter strand of DNA. A diagonal vertex (H) represents a direct match of bases and their position in comparator strands. These possibilities are depicted in Figure 2B. A path between two vertices on the alignment graph is a walk along the edges of the graph in the permissible directions that connects these two vertices. In what follows we will only be interested in paths on the alignment graph from the upper left to the lower right vertices of the graph. In Figure 1A those vertices are marked by the symbol . Many such paths exist, of course. Figure 2 depicts three paths that connect these vertices. Consider now Figure 1B again. If for any path on the alignment graph, a record is made for the directions followed on each step, using the labels H, I, and S as depicted in the figure, each path on the alignment graph cane be viewed as a sequence formed from these three characters. The black path in Figure 2 then corresponds to the sequence L: SSIDDSSSDIISSI and the red path corresponds to L: DDDISSSSSSIIIID. Notice for these paths #I + #S = 10 and #D + #S = 11. The opposite is also true: any sequence formed of the characters H, I, and S for which #I + #S = 10 and #D + #S = 11, defines a path between the upper left and lower right vertices of the alignment graph. For instance, the blue path in Figure 2 represents the alignment of S1 and S2 defined in Example 1-b). The following should now be clear: An alignment of two sequences S1 and S2 can be represented by a path from the upper left to the lower right corner on the corresponding alignment graph. Exercise 4. In Figure 2, trace the paths that correspond to the alignments from Exercise 1-a) and 1-c). Directed edges provide pictorial means of tracing the shortest path to an alignment solution, steps dictated by an algorithm constructed to yield the most parsimonious alignment. An alignment graph visually conveys the task an algorithm completes. 3. Alignments as Weighted Paths on a Graph Although every path on the alignment graph between the upper left and lower right corners corresponds to an alignment, there are many paths that correspond to biologically meaningless alignments. For instance, a path that goes straight down to the bottom of the graph and then all the way to the right (this path would correspond to the alignment from Exercise 1-c) can be used to align any two sequences, even when there may not be a single base-pair match. On the other hand, alignments corresponding to paths with more diagonal steps are more likely to produce biologically meaningful matches. The goal is to find alignments that have high biological likelihood. One way to approach this problem is to assign a weight (penalty) to each of the edges in the alignment graph with exact matches incurring no penalty. Insertions, deletions and actual substitutions of bases, on the other hand, are penalized. The actual values of the penalties are based on biological consideration. In this project, we use the following very simple penalty scheme to illustrate the concept. Exact matches are not penalized There is a constant penalty p for a substitution of one base with another (mutation). There is a constant penalty x for an insertion There is a constant penalty y for a deletion. A good way to display these on the alignment graph is to “color code” its diagonal edges. Figure 2B shows edges that correspond to exact matches in green, and those corresponding to mutations in red. Including a green edge in the path is not penalized while including a red one is. Any vertical edge included the path incurs a penalty x and any vertical edge included in the path incurs a penalty y. The alignment graph with the weights described above defines the weighted alignment graph for the sequences S1 and S2. If you compare an alignment graph to the optimal alignment solution dictated by an algorithm, you will see that they differ [scan in optimal alignment solution and alignment graph from Pachter and Sturmfels, p. 51-52] in the placement of sequence 1 bases in comparison to sequence 2. The directed edges of the alignment graph are analogous to possible base pair outcomes: direct match (S), insertion in one sequence (I), deletion in one sequence (D). The optimal alignment reports the solution that best fits what the algorithm dictates, listing sequence 1 and sequence 2 in parallel and symbolizing direct matches as vertical lines between bases; insertions and deletions are depicted as *. Optimal sequences represent hypotheses about how evolutionary processes may have modified one sequence with respect to another. When an investigator tasks an algorithm with providing an optimal alignment solution, the gene sequences being compared regularly contain 100’s or 1000’s of bases. While optimal solutions can be reported readily, alignment graphs of the same sequences would be unreadable. For instance, the comparison of a sequence with 100 bases to another with 100 bases would require a graph space occupying a 100 x 100 grid. Our smaller sequence examples, limited to no more than 15 bases per sequence, enable you to visualize the strategy behind how an optimal alignment solution is produced by a sequencing algorithm. Alignment tools ( http://www.seas.gwu.edu/~simhaweb/cs177/alignmentlecture/index.html ) - Pair name (label for “select sequences”) Human Pax6 vs mosquito Pax6 Human calneuron vs chimp calneuron Human Sry vs mouse Sry Human huntingtin vs dog huntingtin ELVIS vs LIVES Gene Human Pax6 (NM_000280) Sequence 1 Sequence Gene CACAGCGGGGCCCGG Mosquito XM_311087 Sequence 2 Sequence CACTCGGGCGCCCGG Co on oth ga Human calneuron GGACTTAGATGGGAG 1 (NM_001017440) Chimp calneuron GGACTTGGATGAGAG 1 (XM_001142457) tw ga Human Sry (NM_003140) Mouse Sry (NM_011564) TACAGAGATCAGCAA Human huntingtin CAGTTTCTACACCCT (NM_002111) Dog huntingtin (XM_536221) CAGTTTCTATGGCCT ga ve be ga ins Human synapsin II (BC051307) Human SLITROBO Rho GTPase activating protein 1 (BC053903) CTCATTGTGGAAAGC CTCAGAGATCAGCAA GAACTAGTCATCAGC no sim wa