Education and Computational Biology 4.10

advertisement
Education and Computational Biology:
Incremental K-Leaf Powers, Cliques, and Phylogeny Reconstruction
Dean L. Zeller
Department of Computer Science
Kent State University
Spring, 2006

Abstract – The following is an investigation in areas of
phylogenetic reconstruction for the purpose of
encouraging education of algorithm analysis. Discussed
within is an alternate evolution model named the
incremental k-leaf power, serving as an extension of the kleaf power in regards to phylogenetic reconstruction.
Class assignments requiring little prior knowledge of
evolution trees have been created implementing these
concepts.
At this point, the discussion is purely
theoretical in nature and not based on any existing
genetic tests. The full text for the class assignments is
available from the author.
Index terms – graph theory, computational biology,
evolutionary tree, phylogeny reconstruction
“…the great Tree of Life fills with its dead and broken
branches the crust of the earth, and covers the surface with
its ever-branching and beautiful ramifications.”
Charles Darwin (1809-1882)
Father of Evolution
I. INTRODUCTION
Phylogenetic analysis is an important tool
in evolutionary biology and bioinformatics.
Reconstruction of phylogenetic trees and their
subsequent analyses provide important
insights into the evolutionary history of
organisms, as well as provide clues about
mechanisms governing evolutionary changes
in
genes
and
genomes.
Generally,
phylogenetic relationships among taxa are
represented in a bifurcating tree format, with
or without root, where terminal nodes
represent extant taxonomic units and interior
nodes represent inferred ancestral units. These
nodes are connected via internal and external
branches. A node is bifurcating when it has
Manuscript received May 11th, 2006.
This work is for
presentation at the OCCBIO ’06 converence at Ohio University (June
28-30, 2006).
D. L. Zeller is in the computer science Ph.D. program at Kent
State University, Kent, OH.
(phone: 330-343-2877, email:
dzeller@cs.kent.edu)
only two immediate descendant lineages, and
multifurcating when it has more than two
immediate descendants. It should be noted
that a bifurcating representation is often
insufficient to represent all aspects of
evolution.
Horizontal gene transfer and
hybrid
speciation
are
evolutionary
occurrences do not follow a bifurcating
pattern. There are other ways to represent
these relationships such as phylogenetic
networks.
Computational biology is a field
commonly regarded as complicated. Some
problems require a great deal of previous
knowledge and experience to comprehend.
But there are those problems in
bioinformatics easy enough to analyze
without requiring an advanced degree in
biology, mathematics, or computer science.
This paper dwells into developing lessons in
computational biology that are easy to
understand at all educational levels. These
assignments can be used as part of a
curriculum for the next generation of
computer scientists.
Section II gives background in evolution
trees, k-leaf powers, and cliques. Section III
discusses the class assignments that have been
tested with favorable results on introductorylevel computer science students.
II. BACKGROUND
A. Evolution trees
An evolutionary tree (or phylogeny) is a
tree structure that represents the genetic
relationships among species.
Each leaf
represents one species, while internal nodes
represent the common ancestors between leaf
species. As evolution can be represented
using a tree model, it is a topic of particular
interest to both biologists and computer
scientists. Advancements in theory for one
area can easily apply to the other. An
evolutionary tree is a common method to
organize inheritance within object oriented
program design (OOD). It has been the topic
of hundreds of papers published each year.
As technology has advanced, man’s ability to
build and represent evolutionary trees has
increased in complexity. Computer programs
are able to store and analyze evolutionary data
faster than before. It is with technological
and algorithmic advancements that geneticists
were able to create the GENOME project, a
complete genetic mapping of the human body.
The importance of further development in this
area is staggering.
It has become a
controversial
topic
among
doctors,
philosophers, and even religions. Figure 1
shows an example of an evolutionary tree
using existing species.
monkey
cat
weasel
dog
bear
seal sea
lion
raccoon
Figure 1: Evolution Tree example [3]
One method for solving computational
biology problems is maximum-parsimony.
This method uses character-based treebuilding algorithms to find the evolutionary
tree that requires the fewest changes to
explain evolution of the observed species.
While this method is interesting and useful, it
will be the topic of a future paper. A second
method for solving computational biology
problems is distance-based. With a distance
matrix, one can represent dissimilarity of
species. Using M as an nn diagonally
symmetric matrix, M[i,j] is a measure of
dissimilarity between species i and j.
B. K-Leaf Powers and K-Leaf Roots
As previously stated, a bifurcating tree is
an incomplete model of evolution.
An
alternative model of evolution is the k-leaf
power graph and corresponding k-leaf root. A
tree T is converted into a k-leaf power Gk for a
specified k-value by connecting an edge (u,v)
in Gk for any pair of vertices not more than a
specified k-value. In mathematical terms,
dT(x,y)  k  (x,y)  Gk.
a
b
f
c
c d
e
d
k-leaf power
a b
e
k-leaf root
Figure 2: k-leaf power and k-leaf root
The problem of converting a given tree
into a k-leaf power is trivial. A brute-force
algorithm simply calculates the distance
between every vertex pair within T and
assigns the edges in Gk accordingly. The
interesting problem of research is to go the
other direction: given a k-leaf power Gk,
generate the tree structure, called the k-leaf
root. This problem is far more complex, as
not all trees have a corresponding k-leaf
power. Brandstädt and Le wrote linear time
solutions for k=3 [1] and k=4 [2], but k5
remains an open problem. It should be noted
that these algorithms are sufficiently complex
in nature to require advanced knowledge of
algorithm analysis. Figure 2 shows a simple
example of the conversion of a k-leaf power
into a k-leaf root. The class assignments
described below are a deeper look into the
relationship between k-leaf powers and k-leaf
f
roots through an incremental k-leaf power
evolution model.
the format of Atlas to create a similar
reference of k-leaf powers and k-leaf roots.
C. Cliques
A clique is defined as a subset of mutually
adjacent vertices. It is observed that as a
k-leaf power builds incrementally for a given
tree, new members to cliques are introduced.
It is useful to indicate the cliques in the k-leaf
power diagrams. The incremental formation
of cliques give information on the structure of
the k-leaf root. Figure 3 shows examples of
cliques of up to six vertices.
A. Assignment 1 – Evolution Trees
A few basic assumptions will greatly
reduce the problem complexity, allowing for
greater in-depth analysis. In a bifurcating tree
it is assumed that the only possible point of
evolution is a parent with two children. A
single child (also called a redundant node)
can be removed without losing any
evolutionary data. It is given that the species
is continually changing over time through
evolution and mutation, and thus a single
node in between two nodes does not add to
the information portrayed in the phylogeny.
It is also assumed a population will not split
into more than two species. While there are
cases in evolution where this happens, it
greatly adds to the complexity of the problem.
Multifurcating tree sections are replaced with
isomorphic approximations. Unlike the first
assumption, there is some loss of detail.
Figure 3: Cliques
III. CLASS ASSIGNMENTS
The basis for this portion of the paper was
inspired by a book named An Atlas of Graphs
[5]. In 1999, Read and Wilson released this
reference of every possible topology for a
variety of graphs, including trees, binary
trees, outerplanar graphs, polyhedron graphic
representations, and many others. At first
glance, it seemed to be a simple collection of
graphs, along with some analysis. However,
the message portrayed by the pages and pages
of morphing graphs was far deeper and more
complex than a basic reference. Studying the
method used to create every possible model
topology can teach volumes on the structure
and nature of the graph type specified. It was
a simple, yet elegant method of portraying
information, as if the graphs themselves
represented the words of a paragraph. If a
picture is worth a thousand words, imagine
the amount of information contained within a
series of sequential graphs, all designed to
show an idea. Assignments 1 and 2 mimic
a) Redundant node
b) Isomorphic approximations
=
=
c) Isomorphic trees
Figure 4: Assumptions
The third assumption is to consider only
isomorphically unique trees. This analysis is
currently at the theoretical level only, and
thus the labels are abstract.
As such,
isomorphic trees are considered equal, and
only one needs analysis. These assumptions
greatly limit the number of tree patterns that
can be created through reconstruction. Figure
4 shows these assumptions graphically.
Assignment 1 creates displays on all evolution
trees of up to seven leaves, following the
assumptions described above.
Example 1
Phylogeny T is converted in the k-leaf powers for kvalues of 2 through 8.
T
a
G2
b
i
a
i
c
h
f
g
b cd e g h
d
e
f
{bc}{de}{gh}
a
G3
i
a
G4
b
b
i
c h
h
g
d
c
g
e
f
{abcde}{ai}{fghi}
a
c
c
h
g
d
d
e
f
{abcde}{afghi}
a
c
d
f
b
i
h
g
a
G8
b
i
e
f
{abcde}{afi}{fghi}
G7
b
i
h
g
a
G6
b
i
e
f
{bc}{de}{fgh}{fi}
G5
d
e
{abcdef}{abi}{aghi}
c
h
g
d
f
e
{abcdefghi}
B. Assignment 2 – Incremental K-leaf powers
Reconstruction of a phylogeny from a
k-leaf power requires connected vertices to
form a clique. For k=2, cliques contain only
two vertices. For k=3, cliques can have up to
three vertices, forming a triangle structure.
As the k-value increases, so does the
maximum possible size of the clique. One
key to phylogeny reconstruction is to identify
the cliques.
In Example 1, a single phylogeny is
converted incrementally into various k-leaf
powers for k-values of 2 through 8, resulting
in graphs G2 through G8. In G2 the edges are
connected only for those cases where the
vertices have the same parent. G3 has more
edges than in G2, identifying close relations,
but it is still not a connected graph. At G4, all
leaf vertices from the phylogeny are
connected. Note the left and right “clusters”
of edges, and the similarity to the structure of
T. G5 is similar to G4, with one additional
edge. G6 is a fuller graph than G5, and G7 is a
nearly complete graph. Since no two leaves
in T are more than eight separations apart, G8
is a complete graph where all vertices form a
clique. In this case, it appears that G4 has the
most structural information useful to a
geneticist. Observe the cliques described in
Example 1. For k=2, the cliques are {ab} and
{ef}. At k=3, the {ab} clique obtains c as a
new member, and a similar situation with {ef}
acquiring d. This process is repeated each
time the k-value increases. Assignment 2
creates the incremental k-leaf powers for each
tree generated in assignment 1.
C. Assignment 3 – Phylogeny Reconstruction
Conceivably, the genetic test could
contain more accurate information than just a
1 or a 0, resulting in discrete values as
categories of closeness. The k-leaf power
reconstruction algorithms in [1] and [2] do not
make any assumptions on how closely related
two species are, only that they are genetically
compatible to a certain degree. With a
discrete test, the closeness value could be
used to more quickly and accurately
reconstruct the phylogeny. By connecting
those species registering closer according to
the test, complex algorithms may not be
necessary for reconstruction. Example 2
demonstrates how results from a discretevalue test on all distinct pairs of six species
collected in the summary table can be used to
reconstruct a phylogeny incrementally. This
is a contrived example that will convert into a
phylogeny; there are many examples of data
that do not convert as easily. Once the data is
collected, the tree is built up from the lowest
values first. When all species in the tree are
connected, the tree is complete and no further
analysis is necessary. Assignment 3
reconstructs a phylogeny from a variety of
data tables using the incremental k-leaf power
approach.
Example 2
Input: Results from discrete similarity experiments
Output: Incremental k-leaf powers and k-leaf roots for
values of 2 through 6.
Diff(a,b)
Diff(a,c)
Diff(a,d)
Diff(a,e)
Diff(a,f)
Diff(b,c)
Diff(b,d)
Diff(b,e)
Diff(b,f)
a
b
c
d
e
f









Diff(c,d)
Diff(c,e)
Diff(c,f)
Diff(d,e)
Diff(d,f)
Diff(e,f)
Difference Summary Table
a
b
c
d
e
2
3
5
6
3
5
6
4
5
3
2-leaf power
a
2
3
5
6
6
3
5
6
6






4
5
5
3
3
2
f
6
6
5
3
2
2-leaf root
b
CONCLUSIONS
It is not enough to simply learn and
discuss the methods other researchers have
done in the past. Publishable results must be
found to further develop work in this area.
Few subjects have greater need to be explored
than evolutionary trees. Biologists use
evolutionary trees to represent the history of
life on Earth, and as more genomic data
become available, we need better and more
efficient methods to do phylogenetic analyses.
The evolutionary tree problems discussed in
this paper need refinement, clarity, and
further development.
New methods of
analyzing genetic data must be discovered.
The field of computer science is also
evolving. Fast, reliable biological computers
will replace the slower electronic models. It
is our job as educators to prepare the next
generation of computer programmers.
f
c
e
a b c d e
d
3-leaf power
a
3-leaf root
b
f
c
e
c d
a b
d
4-leaf power
a
e
f
4-leaf root
b
f
c
e
f
d
c d
a b
e
f
ACKNOWLEDGEMENTS
The author would like to appreciate the
effort portrayed by the following CS10051
students for successfully completing all three
parts of the class assignments described in
this paper: Lindsey Braun, Rachel Holloman,
Ryan Lundin, Daniel Rubin, Kelly
Schobinger, and Thomas Zgonc.
REFERENCES
[1] Brandstädt, A. and V.B. Le, Structure and Linear
Time Recognition of 3-Leaf Powers, Information
Processing Letters (98), 133-138, 2006.
[2] Brandstädt, A., V.B. Le, and R. Sritharan,
Structure and Linear Time Recognition of 4-Leaf
Powers, submitted for publication, 2006.
[3] Felsenstein, J, Inferring Phylogenies, Sinauer
Associates, Inc., 2004.
[4] Gusfield, D, Algorithms on Strings, Trees, and
Sequences, Cambridge University Press, 1997.
[5] Read, R.C., and R.J. Wilson, An Atlas of Graphs,
Oxford Science Publications, 1999.
[6] Wu, B.Y. and K.M. Chao, Spanning Trees and
Optimization Problems, Chapman & Hall/CRC,
2004.
Dean L. Zeller is in the computer
science Ph.D. program at Kent State
University. He achieved his Masters
of Science in computer science from
Bowling Green State University in
1996, as well as his Bachelors of
Science in mathematics and computer
science in 1992.
He has ten years of teaching
experience in the education industry.
Eight of the ten years have been in
higher
education
environments,
teaching classes introducing new students to computers, mid-level
programming classes such as data structures and computer
architecture, and up to advanced courses on compiler design,
artificial intelligence, and computability theory. He also has two
years of experience teaching to high- and middle-school students.
He has published an article on the Minimization statistical algorithm
in 1997 in the Journal of Nursing Research, teaming up with the
Department of Nursing at Case Western Reserve University. His
research interests include: graph theory, statistical algorithms, user
interface design, and computer science education.
Download