GST II: ---Title - University of Missouri

advertisement
Phylogenetic Tree
Construction
Dong Xu
Computer Science Department
271C Life Sciences Center
1201 East Rollins Road
University of Missouri-Columbia
Columbia, MO 65211-2060
E-mail: xudong@missouri.edu
573-882-7064 (O)
http://digbio.missouri.edu
Outline
Evolution theory
 Concept of phylogeny
 Molecular clock
 Types of trees
 UPGMA
 Parsimony
 Maximum likelihood
 An example for bird flu

Evolution
 Many
theories of evolution
 Basic
idea:
speciation events lead to creation of
different species
Any two species share a (possibly
distant) common ancestor
Evolutionary Events

Extinction: A new node u is created at the end of a
lineage, no new lineage is started from u

Speciation: A new node u is created at the end of a
lineage, and two new lineages are started from u

Hybridization: A new node u is created
 when two lineages combine (diploid or polyploid)
 when one lineage creates u and the new lineage from
u has double the number of homologs (autopolyploid)
Tree of Life
http://tolweb.org/
Toxonomy
Glycine max
Taxonomy ID: 3847
Genbank common name: soybean
Rank: species
Genetic code: Translation table 1 (Standard)
Mitochondrial genetic code: Translation table
1 (Standard)
Other names:
common name:soybeans
Lineage( full )
cellular organisms; Eukaryota;
Viridiplantae; Streptophyta;
Streptophytina; Embryophyta; Tracheophyta;
Euphyllophyta; Spermatophyta;
Magnoliophyta; eudicotyledons; core
eudicotyledons; rosids; eurosids I;
Fabales; Fabaceae; Papilionoideae;
Kingdom Plantae
 Evolutionary
 From
tree of plants
primitive more advanced
traits
moncot
Gymnosperms
Non-vascular _______
Dicot 
Green
alga
ancestor
Flowers
Vascular 
__________
Monocot vs. dicot plants (1)
FEATURE
MONOCOTS
DICOTS
1
2
Leaf venation
parallel
broad
Root system
Fibrous
Tap
In 3’s
In 4’s or 5’s
Scattered
Arranged in a
circle
Herbaceous
Either
Cotyledons
Number of floral
parts
Vascular bundle
position
Woody or
herbaceous
Monocot vs. dicot plants (2)
 Number
of cotyledons: one vs. two
Monocot vs. dicot plants (3)
 Leaf
venation pattern:
 Monocot
 Dicot
is parallel
is net pattern
Monocot vs. dicot plants (4)

Flower parts:

Monocot: in groups of three

Dicot: in groups of four or five
Outline
Evolution theory
 Concept of phylogeny
 Molecular clock
 Types of trees
 UPGMA
 Parsimony
 Maximum likelihood
 An example for bird flu

Phylogenies (1)

A phylogeny is a tree that describes the
sequence of speciation events that lead to
the forming of a set of current day species
Aardvark Bison Chimp Dog
Elephant
Phylogenies (2)

Leafs - current day species

Nodes - hypothetical most recent common
ancestors

Edges length - “time” from one speciation
to the next
Primate Evolution
Tree Terminology
a
b
c
d
leaf
{a,b}
cluster
edge
{a,b,c}
{a,b,c,d}
internal node
root
Rooted/Unrooted Tree


Rooted trees

Single common ancestor

Requires more information
Unrooted trees

Objects are leaves

Internal nodes are some common ancestors

Insufficient information to tell whether not not a given
internal node is a common ancestor of any 2 leaves
Motivation

Understand the lineage of different species

Organizing principle to sort species into a
taxonomy

Understand how various functions evolved

Understand forces and constraints on evolution

Perform multiple sequence alignment

Predict gene function (phylogenetic footprint)
Tree Basis

Phylogenies are reconstructed based on
comparisons between present-day objects

Two main aspects
Topology

How its interior nodes connect to one another and
to the leaves
Distance

An estimate of the evolutionary distance between
the nodes
Assumptions
 homology
reflects common ancestry
 single common ancestor
 treelike relationship exists
 positional homology
 independent processes
 no reversals or convergence
 molecular clock
Outline
Evolution theory
 Concept of phylogeny
 Molecular clock
 Types of trees
 UPGMA
 Parsimony
 Maximum likelihood
 An example for bird flu

Molecular Clock Theory (1)

For any given protein, accepted mutations in the
amino acid sequence for the protein occur at
constant rate

Accepted = mutations that allow protein to
function without death

Implication
# of accepted mutations proportional to length of
time interval
i.e. relatively constant rate of accepted mutations
within a protein
Molecular Clock Theory (2)

Rate of accepted mutations maybe different
for different proteins (depending on their
tolerance for mutations)

Different parts of a protein may evolve at
different rates

Thus, if A and B differ by k accepted
mutations, then roughly k/2 mutations have
occurred since divergence
Science vol. 289
Outline
Evolution theory
 Concept of phylogeny
 Molecular clock
 Types of trees
 UPGMA
 Parsimony
 Maximum likelihood
 An example for bird flu

Species/Gene Trees (1)

Species tree (how are my species
related?)
 contains only one representative from each species
 when did speciation take place?
 all nodes indicate speciation events

Gene tree (how are my genes related?)
 normally contains a number of genes from a single
species
 nodes relate either to speciation or gene duplication
events
Species/Gene Trees (2)
• Your sequence data may not have the same
phylogenetic history as the species from
which they were isolated
•Different genes evolve at different speeds,
and there is always the possibility of
horizontal gene transfer (hybridization,
vector mediated DNA movement, or direct
uptake of DNA).
Morphological vs. Molecular
 Classical
phylogenetic analysis:
morphological features
number of legs, lengths of legs, etc.
 Modern
biological methods allow to
use molecular features
Gene sequences
Protein sequences
Dangers in Molecular
Phylogenies
Gene/protein sequence can be
homologous for different reasons:

Orthologs -- sequences diverged after a
speciation event

Paralogs -- sequences diverged after a
duplication event

Xenologs -- sequences diverged after a
horizontal transfer (e.g., by virus)
Ultrametric trees (1)
A metric on a set of objects O given by the
assignment of a real number d(x,y) to every pair
x,y in O
Ultrametric trees (2)
An ultrametric has to fulfill the additional
requirement
An ultrametric tree is characterized by the three
point condition
Additive Trees

Generalization of ultrametric trees
 # of mutations were assumed to be proportional to
temporal distance of a node to ancestor
 Also assumed, mutations took place at same rate in
all branches

Additive trees model different rates of
mutation along different branches
Additivity
In “real” tree, distances between
species are the sum of distances
between intermediate nodes
k
d (i , j )  a  b
c
a
i
m
b
d (i , k )  a  c
j
d ( j ,k )  b  c
1
c = d (m, k )  (d (i , k )  d ( j , k )  d (i , j ))
2
Phylogeny Construction
 parsimony
methods: fewest changes
 likelihood
methods: maximize the
probability
 distance
methods: based on pairwise
evolutionary distances (sequence
similarity, nucleotide composition,
etc.)
Outline
Evolution theory
 Concept of phylogeny
 Molecular clock
 Types of trees
 UPGMA
 Parsimony
 Maximum likelihood
 An example for bird flu

UPGMA
UPGMA is the unweighted pair group
method with arithmetic mean
 Distance matrix can come from (e.g) DNADNA hybridization, or be constructed from
sequence data etc.
 Iteratively group the most closely related
groups. The average distance between
elements in two groups is the distance
between the groups.

UPGMA Procedure
1.
find closest pair of units (species, to start
with)
2.
connect this pair, defining an
evolutionary unit (branch)
3.
compute distances from the ancestor of
this unit to all other ungrouped units -Branch length is distance/2
4.
go back to #1 and repeat
Evolutionary distances
among primates (1)
nucleotide substitutions per 100 sites
Chimp
Gorilla
Orang
Rhesus
Human
1.45
1.51
2.98
7.51
Chimp
Gorilla
Orang
1.57
2.98
7.55
3.04
7.39
7.10
Humans and chimps are closest:
lump them and recompute distances
H
C
Evolutionary distances
among primates (2)
Gorilla
Orang
Rhesus
H-C
1.54
2.98
7.53
Gorilla
Orang
3.04
7.39
7.10
e.g., (H-C) to gorilla distance
G
= (H-G+C-G)/2
= (1.51+1.57)/2 = 1.54
Gorilla is closest to H-C clade
(((H, C), 1.45), G, 1.54)
H
C
Evolutionary distances
among primates (3)
Orang
Rhesus
H-C-G
3
7.46
Orang
7.10
R O
 Human-ChimpGorilla is closer
to Orang than to Rhesus
G
H
C
UPGMA Clustering

Let Ci and Cj be clusters, define distance
between them to be
1
d (Ci ,C j ) 
d ( p, q )


| Ci || C j | pCi q C j

When we combine two cluster, Ci and Cj,
to form a new cluster Ck, then
d (C k , Cl ) 
| Ci | d (Ci , Cl ) | C j | d (C j , Cl )
| Ci |  | C j |
UPGMA: conclusions
 UPGMA gives
branch lengths or
evolutionary distances as well as
branching order
 if (a big if) mutations occur at a
constant rate, we can estimate dates
of divergence from sequence
differences
Outline
Evolution theory
 Concept of phylogeny
 Molecular clock
 Types of trees
 UPGMA
 Parsimony
 Maximum likelihood
 An example for bird flu

Possible Evolutionary Tree (1)
t1
1 three-taxa tree
t2
1*(2*3-3) = 3 four-taxa trees
t3
t1
t1
t3
t2
t2
t4
t3
t4
t2
t3
t1
t4
Possible Evolutionary Tree (2)
Taxa (n)
rooted
unrooted
(2n-3)!/(2n-2(n-2)!) (2n-5)!/(2n-3(n-3)!)
2
1
1
3
3
1
4
15
3
5
105
15
6
954
105
7
10,395
954
8
135,135
10,395
9
2,027,025
135,135
10
34,459,425
2,027,025
Possible Evolutionary Tree (3)
Taxa (n): 2
Taxa (n)
3
Unrooted/rooted
2
1/1
3
1/3
4
3/15
4
Maximum parsimony (1)
Minimizes the number of steps required to
generate the observed variation in the
sequences
 Guaranteed to find the "best" tree - danger
of over-fitting the data
 Columns representing greater variation
dominate
 Works best for small, highly conserved
sequences

Maximum parsimony (2)
Begin with a multiple sequence alignment
 Identify informative sites within the
sequences
 Tree requiring smallest number of changes
identified
 Repeat over all informative sites
 Length = sum of the # of steps in each
branch
 Choose tree with smallest length

Maximum parsimony (3)
Sequence position and character
Taxa
1
2
3
4
5
6
7
8
9
1
A
A
G
A
G
T
G
C
A
2
A
G
C
C
G
T
G
C
G
3
A
G
A
T
A
T
C
C
A
4
A
G
A
G
A
T
C
C
G
Maximum parsimony (4)
A
A ACGA
C
Tree 1
B
A
B ATGC
C GTGC
D
B
D GCAA
Tree 2
C
D
A
B
1 2 3 4 Total
Tree 3
D
C
Tree 1
1 2 1 2
6
Tree 2
2 2 1 2
7
Tree 3
2 1 1 1
5
Parsimony on
genomic sequence
Site
Human
Chimp
Gorilla
Orang
Recent branch
34
A
G
A
G
human-gorilla
560
C
C
A
A
human-chimp
1287
*
*
T
T
human-chimp
3057-
****
****
TAAT
TAAT
human-chimp
A
C
C
A
chimp-gorilla
3060
5153
human-chimp chimp-gorilla
12
3
human-gorilla
4
Outline
Evolution theory
 Concept of phylogeny
 Molecular clock
 Types of trees
 UPGMA
 Parsimony
 Maximum likelihood
 An example for bird flu

Probabilistic Approaches
to Phylogeny (1)
Notation and definitions:
Let P(x•|T,t•) denote the probability of a set
of data given a tree, where:
 x• denotes n sequences
 T denotes a tree with n leaves with sequence j at
leaf j
 t• denotes the edge lengths of the tree
The definition of P(x•|T,t•) depends on our
choice of model of evolution.
Probabilistic Approaches
to Phylogeny (2)
Let P(x|y,t) denote the probability that
sequence y evolves into x along an edge
of length t.
Assume that we can define P(x|y,t).
If we can do this for each edge of T  we
can calculate the probability of T.
Probabilistic Approaches
to Phylogeny (3)
Ridiculously simplistic model of evolution:
1. Every site is independent
2. Deletions and insertions do not occur
3. Substitution accounts for all evolution
Let P(b|a, t) denote the probability of the
substitution of residue b for residue a over an
edge length of t.
Extending to aligned gapless sequences x and y,
P(x | y, t) = PuP(xu|yu, t), where u indexes over
sites
Probabilistic Approaches
to Phylogeny (4)
P(x1,.., x5|T, t•) =
P(x1|x4,t1)P(x2|x4,t2)P(x3|x5,t3)P(x4|x5,t4)P(x5)
root
x5
t4
x4
t2
t3
t1
x2
x1
x3
Divergence time estimates for major groups. Thick bars on branches denote fossil record of
fungi; solid circles are calibration points. From Heckman et al. 2001. Science 293: 1132
Confidence Assessment
Bootstrap values
Bootstrapping is a statistical
technique that can use random
resampling of data to determine
sampling error for tree topologies
Bootstrapping phylogenies

Characters are resampled with replacement to
create many bootstrap replicate data sets

Each bootstrap replicate data set is analysed (e.g.
with parsimony, distance, ML etc.)

Agreement among the resulting trees is
summarized with a majority-rule consensus tree

Frequencies of occurrence of groups, bootstrap
proportions (BPs), are a measure of support for
those groups
Outline
Evolution theory
 Concept of phylogeny
 Molecular clock
 Types of trees
 UPGMA
 Parsimony
 Maximum likelihood
 An example for bird flu

Avian Influenza Viruses

Single strand

Negative RNA

Fragmented

Polymorphic
2003/2004 H5N1 Pandemic
 Highly pathogenic; can be transmitted to
people and some cases are fatal
 Virus: 8 genomic segments (PB1, PB2, PA,
HA, NP, NA, M, and NS) and genetic
reassortment
 DNA sequences
o South China Agricultural University, China
o Genbank
 Sources and evolution of flu viruses?
Outbreak History
HA
HA gene
Other 6 segments
(excluding PA) have a
similar tree structure
PA gene
Our analyses suggest that
2003-04 H5N1 pandemic
be caused by multiple
independent
transmissions with
multiple genotypes from
genetic reassortments.
(2001)
Reading Assignments

Suggested reading:
 Chapter 14 in “Warren J. Ewens and
Gregory R. Grant: Statistical Methods in
Bioinformatics – An Introduction. Springer.
2001”.

Optional reading:
 Chapter 17 in “Dan Gusfield: Algorithms on
Strings, Trees, and Sequences. Cambridge
University Press. 1997”.
Project Assignment
Develop a program that implement the
UPGMA algorithm
1. Modify your code in the assignment for global
alignment.
2. Use edit distance (match 1; otherwise 0) with
gap penalty –1 – k (k is gap size) for pairwise
sequence alignment.
3. Use the sequence identity as the tree
distance between leaves.
4. Input format: FASTA in one file.
5. Output format: (((a, b), d1), c, d2))…
Download