Molecular phylogeny, part B

advertisement
7.2 Sequence alignment is the essential preliminary to tree construction.
This is the most important step in molecular phylogeny and a number of issues have to be
considered:
 Sequence Homologs: Sequences that are to be aligned should be homologs. An example of
this are the -globin genes of different vertebrates. This is to satisfy the phylogeny criteria
which states that the sequence should be derived from an common ancestral sequence.
 Non-homologous sequences: If the sequences are not homologous and hence do not share a
common ancestor phylogenetic construction methods will always produce a tree but the
tree will not be of any biological relevance. This type of error commonly occurs when
undertaking homology analysis to assign functions to newly generated gene sequences.
Blast is used extensively as on of the homology analysis methods and hence interpretation
of the data arising from the analysis should be undertaken with care.
 Easy alignments: Correctly aligning the homologous sequence is the next task. In some
cases it is an easy task. A simple sequence alignment is shown below:
Sequence 1
Sequence 2

Difficult alignments: If sequences have evolved and diverged by accumulating insertions
and deletions as well as point mutations, then these sequence are not always easy to align.
Insertions and deletions cannot be distinguished when pairs of sequences are aligned so we
refer to them as indels Below is a pair of difficult sequences for alignment where placing
the indel at the correct location can become a problem.
Sequence 1
Sequence 2
Sequence 1
Sequence 2

AGCAATGGCCAGACAATAATG
AGCTATGGACAGACATTAATG
*** **** ****** *****
GACGACCATAGACCAGCATAG
GACTACCATAGA-CTGCAAAG
*** ******** * *** **
Two possible positions for
the indel
GACGACCATAGACCAGCATAG
GACTACCATAGACT-GCAAAG
*** ********* *** **
The dot matrix technique for alignment: Some alignments can be easily done by "eye
balling" the sequences yet others may require a pen and paper. The simplest is known as
the dot matrix method. The two sequences are written out on the x- and y- axes of the
graph paper at the positions corresponding to the identical nucleotides of the two
sequences. The alignment is indicated by a diagonal series of dots broken by empty squares
where the sequences have nucleotide differences, and shifting from one column to another
where indels occur.
An indel is shown by a
shift in the column
Discontinued dot
indicates a point
mutation
Figure: The dot matrix technique for sequence alignments



Similarity approach is a mathematical based alignment technique: The similarity
approach (Needleman and Wunesh, 1970) aims to maximise the number of identical
matched nucleotides in the two sequences. The distance method, (Waterman, 1976) on the
other hand, minimises the number of mismatches. Often the two approaches will identify
the same alignment as being the best one.
Multiple alignments are generated for more then two sequences: Rarely can one do
multiple alignments with a pen and paper and all the steps required for phylogenetic
analysis is undertaken on a computer. For automatically generating multiple alignments
several computer programs are available (discussed later)
rRNA genes (aka rDNA) and rRNA have been used as molecular chronometers and
phylogentetic studies undertaken. Refer to the section on rRNA for detailed notes on the
methods of aligning these types of nucleic acids.
7.3 Converting the alignment data into a phylogenetic tree
 This step is undertaken after an accurate alignment of homologous sequences has been
generated.

To date no one has devised a perfect method for tree construction and several methods are
used. Extensive comparative tests have been conducted with test sequences yet none of the
methods have failed to identify and particular method as better than the others.

The main distinction between the different tree building methods is the way in which
multiple sequence alignment is converted into numerical data that can be analysed
mathematically in order to construct a tree.
7.3.1 Distance Matrix methods
7.3.1.1 Least squares distance matrix (modified Jukes & Cantor algorithm)
Step 1: Generating a similarity matrix.
Given below is an example alignment of 5 sequences with 25 positions in the alignment:
Seq A
AGAUUCGUCUGUAGGUUUCCACCAA
Seq B
ACAUUCGUGUAUAGGUUUCCACUAA
Seq C
ACAUUCGUGUAGAGGUUUCCACUAA
Seq D
AAGUUCGCUUGGAGGUUUCCACGAA
Seq E
AUCGUGAGAUCCAGGUAUCCACAAU
The first step in the least squares distance matrix is to generate a similarity matrix. For this,
count the number of identical bases in every pair of sequences in the alignment. For example
the number of similar bases between Seq A and Seq B is 21 out of a total of 24. Therefore the
similarity between Seq A and Seq B is 21 / 24 = 0.84.
Seq A
AGAUUCGUCUGUAGGUUUCCACCAA
|X||||||X|X|||||||||||X||
Seq B
ACAUUCGUGUAUAGGUUUCCACUAA
A similarity matrix is generated using this approach for each pair of sequences and a similarity
table can be generated as shown below.
A
B
C
D
E
A
--------------------B
0.84
----------------C
0.80
0.96
------------D
0.76
0.72
0.76
--------E
0.52
0.52
0.52
0.52
----From this table, it can be seen that sequences A and B are 0.84 (= 84%) similar, A and C are
0.80 (=80%) similar, B and C are 0.96 (=96%) similar, etc, etc.
Step 2 Conversion of similarities to evolutionary distances:
Next is the calculation of evolutionary distances from their sequence similarity. Conversion of
similarities to evolutionary distances starts with 1 - similarity (i.e. converting similarity to
difference), which is then usually corrected for the probability of underestimation due to
multiple substitutions (see the box below)
.
Understanding Multiple Substitutions
Multiple substitution occurs when a single site undergoes two or more changes, as shown in the example.
Ancestral Sequence
Modern Sequence
….ATGT….
….AGGT…
….ACGT….
There is only one nucleotide difference between the two modern sequences, but two nucleotide substitution
have actually occurred. If this multiple hit is not recogonised than the evolutionary distance between the two
sequences will be significantly underestimated. Distance matrices are therefore usually constructed using
mathematical methods that include statistical approaches for estimating the amount of multiple substitutions
that have occurred as explained below.
Evolutionary dissimilarity is usually corrected (fudged) because it is an underestimate of the
actual evolutionary distance. Counting differences between two sequences underestimates the
number of changes that occured between them, because more than one evolutionary change at
a single position (e.g. A -> G -> U) counts as only one difference between two sequences, and
in the case of reversion counts as no change at all (e.g. A -> G -> A). One way to correct
evolutionary distances is the Jukes & Cantor method. This method is a conversion relating
similarity to evolutionary distance such that difference (dis-similarity) and distance are very
close initially, but levels off at 0.25 similarity, where evolutionary distance is infinite. This
makes sense; in two sequence that are very simliar, the frequency of multiple changes at a
single site is low, requiring only a small correction, whereas two random sequences will be
appear to be 25% similar, just because there are only 4 bases and 1 out of 4 will match by
chance.
With all of the similarities converted to evolutionary distances (whether or not they are
corrected, or how they are corrected), you have a distance matrix:
A
B
C
D
E
Corrected evolutionary distance (ED)
A
B
C
D
E
--------------------0.18
----------------0.23
0.04
------------0.29
0.35
0.29
--------0.77
0.77
0.77
0.77
-----
7.3.1.2 Neighbor Joining Method approach. for building a tree from distance matrix
The neighbor-joining method is a popular tree-building procedure that uses the distance matrix
generated by distance matrix methods as described above.
This done by starting with two of the sequences, separated by a line equal in length to the
evolutionary distance between the sequences:
Then the next sequence is added to the tree such that the distances between A, B and C are
approximately equal to the evolutionary distances. Notice that the fit isn't perfect. If we could
determine the evolutionary distances exactly, they would fit the tree exactly, but since we have
to estimate these distances, the numbers are fit to the tree as closely as possible using a leastsquares best fit.
The next step is to add the next sequence, again re-adjusting the tree to fit the distances as well
as possible:
And at last we can add the final sequence and readjust the branch lengths one last time using
least-squares:
Notice that the distance between any two sequences is (approximately) equal to the sum of the
length of the line segments joining those two sequences - in other words, the tree is additive.
Interpretation of the phylogeteic tree:
One way to think about it is to imagine that you're looking down on a real tree, with the
branches spread out horizontally and the vertical trunk, coming up at you, is hidden from view
but probably somewhere near the center of the branches. The nodes connecting different sets of
branches represent common ancestors of those branches. This tree is unrooted - the single
common ancestor of all of the sequences cannot be determined in this tree. Some people prefer
dendrograms (see below) because evolutionary distance is easily visualized. In this example,
sequence B and C are the most closely related. Each of these are somewhat less similar to A (a
little closer in the case of seq B; that's why the branch to B is shorter than to C). A, B, and C
are less similar to D, and E is only distantly related to the rest.
Here is another way of looking at the same tree but in a different way, a dendrogram.
A dendrogram shows evolutionary distances along the horizontal axis and assumes a root
somewhere in the middle of the tree, in this case in the branch connecting sequence E to the
rest of the tree. Some people like this representation because the horizontal axis roughly
approximates time.
Whenever possible, it is best to include an outgroup sequence in the analysis; an outgroup is a
sequence that is known to be outside of the group you're interested in treeing. For example, if
you were building trees from mammalian sequences, you might include the sequence from a
reptile as an outgroup. Outgroups provide the root to the rest of the tree - although no tree
generated by these methods has a real root, if you know (from other information) that one of
the sequences is unrelated to the rest, wherever that branch connects to the rest of the tree
defines the root (common ancestor) of that portion of the tree. In the example above, sequences
A - D might be mammalian whereas E might be a reptilian sequence. If the tree included only
mammailian sequences, it would be impossible to know where the root is, but the inclusion of
an outgroup provides that information.
7.3.2 Maximum Parsimony
 Neighbor joining method is a simple way of creating trees as the information content of the
multiple alignment is reduced to its simplest form. Unfortunately as a result of this,
information is lost, in particular those pertaining to ancestral identities at each position in
the multiple alignment.

In Maximum Parsimony method utilses a model in which it is assumed that evolution
follows the shortest possible rout and the correct phylogenetic tree is therefore the one that
requires the minimum number of nucleotide changes to produce the observed differences
between the sequences. Trees are therefore constructed at random and the number of
nucleotide changes that they involve calculated until all topologies have been examined and
the one requiring the smallest number of steps identified. This is represented as the most
inferred tree.

More rigorous but necessitates more data handling. More sequences added means more
trees need to be generated. For example, with five sequences only 15 possible unrooted
trees are generated but with 10 sequences, 2,027,025 unrooted trees and with 50 sequences
the number exceeds the number of atoms in this universe. Not even super computers can
evaluate all the trees with Maximum Parsimony method. This is also true for the
sophisticated methods such as Maximum likelyhood and fastDNAML.
7.3.4 Maximum likelyhood
NOT FOR YEAR 2000 LECTURES
7.3.5 Fast DNAml (Fast DNA maximum likelyhood)
NOT FOR YEAR 2000 LECTURES
7.4 Bootstrapping: Assessing accuracy of a reconstructed tree
7.5 Molecular clocks enable the time of divergence of ancestral sequences to be
estimated
GLOSSARY OF TERMS:
Allele: One of two or more alternative forms of a gene.
Allele frequency: The frequency of an allele in a population.
Allele-specific oligonucleotide (ASO) hybridization: The use of an oligonucleotide probe to
determine which two alternative nucleotide sequences is contained in a DNA molecule.
Ancestral character state: A character state possessed by a remote common ancestor of a group
of organisms.
Ancient DNA: DNA preserved in ancient biological samples.
Bootstrapping or Bootstrap analysis: A method of inferring the degree of confidence that can
be assigned to branch point in a phylogenetic tree.
CAP: The chemical modification at the 5'-end of most eucaryotic mRNA molecules.
CAP binding complex: The complex, also called eIG-4F and comprising the initiation factors
eIF-4A, eIF-4E and eIF-4G, which makes the initial attachment to the CAP structure at the be
beginning of the scanning phase of eucaryotic translation.
Chimera: An organism composed of two or more genetically different cell types.
Chromosome walking: A technique that can be used to construct a clone contig by identifying
overlapping fragments of cloned DNA.
Clone contig: A collection of clones whose DNA fragments overlap.
Clone contig approach: A genome sequencing strategy in which the molecules to be sequenced
are broken into manageable segments, each a few hundred kb or few Mb in length, which are
sequenced individually.
Codon: A triplet of nucleotides coding for a single nucleotide.
Codon bias: Referes to the fact that not all codons are used equally frequently in the genes of a
particular organism.
Cancatemer: A DNA molecule made up of linear genomes linked head-to-tail
Consensu sequence: A nucleotide sequence that represents "average" of a number of related
but nonidentical sequences.
Contig: A contiguous set of overlapping DNA sequences.
Contour clamped homogenous electri field (CHEF): An electrophoresis method used to
separate large DNA molecules.
Convergent evolution: The situation that occurs when the same character state evolves
independently in two lineages.
Degenerate: Refers to the fact that the genetic code has more than one codon for most amino
acids.
Derived character set: A character state that evolved in a recent ancestor of a subset of
organisms in a group being studied.
Directed evolution: A set of experimental techniques that is used to obtain novel genes with
improved products.
Discontinuous gene: A gene that is split into exons and introns.
Distance matrix: A table showing the evolutionary distances between all pairs of nucleotide
sequences in a dataset.
Distance method: A rigorous mathematical approach to alignment nucleotide sequences (by
maximising dissimilarities).
Domain shuffling: Rearrangements of segments of one or more genes, each segment coding for
a structural domain in the gene product, to create a new gene.
Dominant: The allele that is expressed in a heterozygote.
Dot matrix: A method for aligning nucleotide sequences.
Exon: A coding region within a discontinuous gene.
Exon theory of genes: An "introns early" hypothesis which state that introns were formed when
the first DNA genomes were being formed.
Expressed Sequence Tags (EST): A cDNA that is sequenced in order to gain rapid access to
the genes in the genomes.
External node: The end of a branch in a phylogenetic tree, representing one of the organisms or
DNA sequences being studied.
Field Inversion gel electrophoresis (FIGE):
Molecular Clock: A device based on the inferred mutation rate that enables times to be
assigned to the branch points in a gene tree.
Molecular evolution: The gradual changes that occur in genomes over time due to the
accumulation of mutations and structural rearrangements resulting from recombination and
transcription.
Molecular phylogenetics: A set of techniques that enable the evolutionary relationships
between DNA sequences to be inferred by making comparisons between those sequences.
Multigene family: A group of genes, clustered or dispersed, with related nucleotide sequences.
Multiple alignment: An alignment of three or more nucleotide sequences.
Multiple hit or multiple substitution: The situation that occurs when a single nucleotide in a
DNA sequence undergoes two mutational changes, giving rise to two new alleles, both of
which differ from each other and from the parent at that nucleotide position.
Multiregional evolution: A hypothesis that states that modern humans in the Old world are
descended from Homo erectus populations that left Africa over 1 million years ago.
Natural selection: The preservation of favourable alleles and the rejection of injurious ones.
Neighbor-Joining method: A method for the construction of phylogenetic trees.
Nuclear genome: The DNA molecules present in the nucleus of eucaryotic cells.
Open Reading Frame (ORF): A series of codons starting with an initiation codon and ending
with a termination codon. The part of the protein-coding region that is translated into proteins.
Orphan family: A group of homologous sequences genes whose functions are unknown.
OFAGE:
Orthologous: Refers to homologous genes located in the genomes of different organisms.
Outgroup: An organism or DNA sequence that is used to root a phylogenetic tree.
Overlapping genes: Two genes whose coding regions overlap.
Paralogous: Refers to two or more homologous genes located in the same genome.
Parsimony: An approach that decides between different phylogenetic tree topologies by
identifying the one that involves the shortest evolutionary pathway.
Phylogeny: A classification scheme that indicates evolutionary relationships between
organisms.
Proteome: The complete protein content of a cell.
Protogenome: An RNA genome that existed during the RNA world.
Selfisg DNA: DNA that appears to have no function and apparently contributes nothing to the
cell in which it is found.
Sequence tagged site (STS): A DNA sequence that is unique in the genome.
Download