Visualization of a simple DNA segment alignment model

advertisement
Visualization of a simple DNA segment alignment model
Background
Biotechnology has brought us a number of very powerful tools to uncover many of life’s mysteries. These
tools have been developed through the cooperative efforts and skills of molecular biologists, biochemists,
mathematicians, physicists, engineers and computer scientists. Exciting problems can now be examined in
many areas such as: bioinformatics (Human Genome Project), forensic science (CSI), medicine (AIDS),
evolution (phylogeny), genetics, paleontology and anthropology (neanderthal and modern human
comparisons). The double stranded DNA molecule, itself often constructed of millions of nucleotides and
thousands of codons, comprises the genome or genetic package of every organism on earth.
Analysis of mutations or changes in the DNA structure through alignment programs can reveal
evolutionary or phylogenetic relationships of organisms. The degree or number of the shared ancestral
characters (homologies) reflects the relative closeness of the relationships of species of organisms or even
between individuals of the same species. Once the homology is hypothesized, often a difficult but critical
step, the phylogenic tree or trees can be generated based upon the degree of alignment of the homologous
nucleotides. Often difficulties may arise in separating homologies from homoplasies or analogous
characters (e.g. evolutionary convergence, parallel or reversals). On a higher level, shared amino acids,
or protein can also be extrapolated from segments of DNA to construct phylogenies.
MutationsGenetic mutations, a major source of genetic variation utilized by natural selection, can range from tiny
single nucleotide changes or deletions called point mutations (the subject of this module) to
macromutations of whole chromosomes changed by chromosomal additions, deletions or inversions. The
frequency and probability of each type of mutation is different depending upon the relative scale. This
model will examine the variation of these costs in various alignment paradigms.
Depictions of nucleic acids often show double stranded diagrams, sometimes as DNADNA anti-parallel strands (i.e., in illustrations of replicating chromosomes
http://www.elmhurst.edu/~chm/vchembook/images/582dnarepline.gif or in DNA
hybridizations http://www.members.cox.net/amgough/FISH_olgio_hybridization-
deep01_01_03.jpg ) where A binds to T and G to C, and the hydrogen bonds between
single strands are visualized. In other cases diagrams may show RNA-DNA double
strands
http://www.biology.arizona.edu/molecular_bio/problem_sets/mol_genetics_of_eukaryote
s/graphics/05ta.gif , as occur during transcription, the synthesis of RNA from a DNA
template. In RNA U substitutes for T, and all RNA bases have a ribose backbone rather
than deoxyribose as in DNA. In the case of DNA alignment sequences, however,
diagrams show where two strands of DNA code for the same nucleotide at the same
position, and conserved bases are shown as a vertical line extending between the two
strands. Remember that, in DNA alignments, we are exploring the relationship of one
DNA strand’s order of bases to that of another strand with the intents of determining the
level of similarity between them and inferring the origin of that similarity.
The model-
1. Alignments as Sequences of Deletions, Insertions, and Substitutions
Mathematically, the pairwise DNA-sequence alignment problem begins by providing two
sequences S1 and S2 composed from the four characters A, C, G, or T. The following
sequences present an example:
S1: AGTGTTCCAG
S2: AATCGTTACAG
An alignment of for two sequences is a record of “edits” (or lack thereof) in the bases in
S1 that leads to the sequence S2. Allowable edit operations are:
Deletion (D): delete a base
Insertion (I): insert a new base
Substitution (S): replace a base with another base
Notice that a substitution may not represent an actual “edit” as it does not rule out that a
base may be “replaced” by exactly the same base. Biologically, this would correspond to
a match.
Example 1. Below are some possible alignments of the sequences S1 and S2 given as
tables of corresponding pairs, together with the list of operations in L, describing the
alignment. A hyphen in S1 represents an insertion, while a hyphen in S2 corresponds to a
deletion. The sequence L represents the transformation of S1 into S2 by enacting the edits
listed in L on S1 form left to right.
a)
S1:
S2:
L:
A-GTGTTCCAG
AATCGTTACAG
SISSSSSSSSS (1 insertion, 9 substitutions, 0 deletions)
b)
S1:
-AGT-GTTCCAG
c)
S2:
L:
AA-TCGTTACAG
ISDSISSSSSSS (2 insertions, 8 substitutions, 1 deletion)
S1:
S2:
L:
---------AGTGTTCCAG
AATCGTTACAG-------IIIIIIIIIIDDDDDDDDDD
(9 insertions, 2 substitutions, 8 deletions)
□
Notice that not any sequence of edits represents an alignment. This only happens when
the table of corresponding pairs will result in two sequences of equal length. For instance,
L: IISSSSSSSSSS cannot represent an alignment for S1 and S2, since two insertions
in the beginning of S1 will result in a string of 12 characters that cannot be matched to a
string of 11 characters. If #I, #D, and #S denote respectively the number of insertions,
deletions, and substitutions in the sequence L, this observation leads to the following:
A sequence L represents an alignment for S1 and S2 only if #I + #S = 10 and #D + #S =
11.
For sequences S1 and S2 of arbitrary lengths n and m respectively, this condition
generalizes to the following:
A sequence L represents an alignment for S1 and S2 only if #I + #S = n and #D + #S = m.
Exercise 1. Are all three alignments in Example 1 of equal biological importance?
Explain why or why not.
Exercise 2. Verify that the conditions #I + #S = 10 and #D + #S = 11 are satisfied for all
of the alignments presented in Example 1.
Exercise 3. Let S1 and S2 are the same sequences as above. Does each of the sequences L
represent an alignment? Explain why or why not. In case L is an alignment, give the table
of corresponding pairs as in Example 1.
a) L: SSSISSSSSSS
b) L: ISSSSSSSSSS
c) L: SSSDSSSSSSII
d) L: SSIDSSSIDISS
Visualizing DNA strands
2. Alignments as Paths on a Graph
Once an alignment is understood as a sequence L formed of the characters I, D, and S, a
more visual way to represent that alignment is to view it as a path on the alignment graph
of the sequences of S1 and S2. If S1 has length n and S2 has length m, the alignment graph
is a given as triangular lattice of height n and width m such as that in Figure 1. The DNA
bases in S1 label the rows of the lattice while the DNA bases in S2 label the columns.
The alignment graph of our example sequences S1 and S2 is presented in Figure 1A.
The alignment graph displays sequence 1 on the y-axis and sequence 2 on the x-axis,
placing each nucleotide in the order of their appearance on the strands being compared.
The alignment graph is a directed graph. In directed graphs, access from one vertex to an
adjacent vertex via a connecting edge is allowed only in the specified direction. For the
alignment graphs we just described, the possible directions are: one horizontal step from
left to right; one vertical step from top to bottom, and one diagonal step from the upper
left to the lower right corner. A horizontal directed edge (I) corresponds to a situation
where a base in one DNA strand has been inserted in a position not present in the second
strand. The practical outcome is that one DNA strand is longer than its comparator. A
vertical directed edge (D) corresponds to a situation where a base in one DNA strand has
been deleted in a position present in the comparison strand. A deletion yields a shorter
strand of DNA. A diagonal vertex (H) represents a direct match of bases and their
position in comparator strands. These possibilities are depicted in Figure 2B.
A path between two vertices on the alignment graph is a walk along the edges of the
graph in the permissible directions that connects these two vertices. In what follows we
will only be interested in paths on the alignment graph from the upper left to the lower
right vertices of the graph. In Figure 1A those vertices are marked by the symbol
.
Many such paths exist, of course. Figure 2 depicts three paths that connect these vertices.
Consider now Figure 1B again. If for any path on the alignment graph, a record is made
for the directions followed on each step, using the labels H, I, and S as depicted in the
figure, each path on the alignment graph cane be viewed as a sequence formed from these
three characters. The black path in Figure 2 then corresponds to the sequence L:
SSIDDSSSDIISSI and the red path corresponds to L: DDDISSSSSSIIIID.
Notice for these paths #I + #S = 10 and #D + #S = 11. The opposite is also true: any
sequence formed of the characters H, I, and S for which #I + #S = 10 and #D + #S = 11,
defines a path between the upper left and lower right vertices of the alignment graph. For
instance, the blue path in Figure 2 represents the alignment of S1 and S2 defined in
Example 1-b).
The following should now be clear: An alignment of two sequences S1 and S2 can be
represented by a path from the upper left to the lower right corner on the corresponding
alignment graph.
Exercise 4. In Figure 2, trace the paths that correspond to the alignments from Exercise
1-a) and 1-c).
Directed edges provide pictorial means of tracing the shortest path to an alignment
solution, steps dictated by an algorithm constructed to yield the most parsimonious
alignment. An alignment graph visually conveys the task an algorithm completes.
3. Alignments as Weighted Paths on a Graph
Although every path on the alignment graph between the upper left and lower right
corners corresponds to an alignment, there are many paths that correspond to biologically
meaningless alignments. For instance, a path that goes straight down to the bottom of the
graph and then all the way to the right (this path would correspond to the alignment from
Exercise 1-c) can be used to align any two sequences, even when there may not be a
single base-pair match. On the other hand, alignments corresponding to paths with more
diagonal steps are more likely to produce biologically meaningful matches. The goal is to
find alignments that have high biological likelihood.
One way to approach this problem is to assign a weight (penalty) to each of the edges in
the alignment graph with exact matches incurring no penalty. Insertions, deletions and
actual substitutions of bases, on the other hand, are penalized. The actual values of the
penalties are based on biological consideration. In this project, we use the following very
simple penalty scheme to illustrate the concept.
Exact matches are not penalized
There is a constant penalty p for a substitution of one base with another
(mutation).
There is a constant penalty x for an insertion
There is a constant penalty y for a deletion.
A good way to display these on the alignment graph is to “color code” its diagonal edges.
Figure 2B shows edges that correspond to exact matches in green, and those
corresponding to mutations in red. Including a green edge in the path is not penalized
while including a red one is. Any vertical edge included the path incurs a penalty x and
any vertical edge included in the path incurs a penalty y.
The alignment graph with the weights described above defines the weighted alignment
graph for the sequences S1 and S2.
If you compare an alignment graph to the optimal alignment solution dictated by an
algorithm, you will see that they differ [scan in optimal alignment solution and alignment
graph from Pachter and Sturmfels, p. 51-52] in the placement of sequence 1 bases in
comparison to sequence 2. The directed edges of the alignment graph are analogous to
possible base pair outcomes: direct match (S), insertion in one sequence (I), deletion in
one sequence (D). The optimal alignment reports the solution that best fits what the
algorithm dictates, listing sequence 1 and sequence 2 in parallel and symbolizing direct
matches as vertical lines between bases; insertions and deletions are depicted as *.
Optimal sequences represent hypotheses about how evolutionary processes may have
modified one sequence with respect to another.
When an investigator tasks an algorithm with providing an optimal alignment solution,
the gene sequences being compared regularly contain 100’s or 1000’s of bases. While
optimal solutions can be reported readily, alignment graphs of the same sequences would
be unreadable. For instance, the comparison of a sequence with 100 bases to another
with 100 bases would require a graph space occupying a 100 x 100 grid. Our smaller
sequence examples, limited to no more than 15 bases per sequence, enable you to
visualize the strategy behind how an optimal alignment solution is produced by a
sequencing algorithm.
Alignment tools ( http://www.seas.gwu.edu/~simhaweb/cs177/alignmentlecture/index.html ) -
Pair name
(label for
“select
sequences”)
Human
Pax6 vs
mosquito
Pax6
Human
calneuron
vs chimp
calneuron
Human Sry
vs mouse
Sry
Human
huntingtin
vs dog
huntingtin
ELVIS vs
LIVES
Gene
Human Pax6
(NM_000280)
Sequence 1
Sequence
Gene
CACAGCGGGGCCCGG Mosquito
XM_311087
Sequence 2
Sequence
CACTCGGGCGCCCGG
Co
on
oth
ga
Human calneuron GGACTTAGATGGGAG
1
(NM_001017440)
Chimp calneuron GGACTTGGATGAGAG
1
(XM_001142457)
tw
ga
Human Sry
(NM_003140)
Mouse Sry
(NM_011564)
TACAGAGATCAGCAA
Human huntingtin CAGTTTCTACACCCT
(NM_002111)
Dog huntingtin
(XM_536221)
CAGTTTCTATGGCCT
ga
ve
be
ga
ins
Human synapsin
II (BC051307)
Human SLITROBO Rho
GTPase
activating protein
1 (BC053903)
CTCATTGTGGAAAGC
CTCAGAGATCAGCAA
GAACTAGTCATCAGC
no
sim
wa
Download