Phylogenetics workshop 2

advertisement
Phylogenetics workshop:
Protein sequence phylogeny
week 2
Darren Soanes
•
•
•
•
•
•
Species trees
Interpretation of trees
Taxon sampling
Tools
Lateral (horizontal) gene transfer
Fast evolving genes
Using DNA sequence to construct trees
TGCTATT
TGCTTTT
TGCTTTT
TGCTTTT – sequence change due to mutation
TGCTATT – ancestral DNA sequence
Reversals can confuse phylogenies
TGCTATT
TGCTTTT
TGCTTTT
TGCTTTT
TGCTATT
TGCTATT
reversal
TGCTTTT – sequence change
TGCTATT – ancestral DNA sequence
To minimise the effect of reversals
• Use DNA sequences that are evolving slowly –
mutations happen rarely.
• Use long stretches of DNA.
• Align sequences, use the parts of the
alignment that show a high degree of
conservation.
• rDNA sequences (genes that encode
ribosomal RNA) are often used.
Species tree constructed using ribosomal
DNA (rDNA) sequence
Using protein sequences to create species
trees
• Advantages
– protein sequences evolve more slowly than DNA
sequences (many DNA mutations are neutral – they do not
change amino acid sequences)
– reversals are less common than in DNA
• Single copy protein encoding genes identified
• Protein sequences joined together to create a
multiple protein sequence for each species
• Sequences aligned
• Disadvantage – need sequenced genomes
Fungal species trees – more proteins = better resolution
oomycete (not fungi)
microsporidia
30 proteins
plant
zygomycete
basidiomycetes
ascomycetes
yeasts
60 proteins
filamentous ascomycetes
Fungal Species Tree (based on 153 concatenated
protein sequences)
Clades
A clade consists of an ancestor
organism and all its descendants.
Gene trees
• The evolutionary history of genes can be
represented as phylogenetic trees based on
alignment of protein sequences.
• Gene duplication and loss can be inferred
from phylogenetic trees.
• Protein sequences evolve more slowly that
DNA sequences (due to redundancy in genetic
code)
Gene duplication
• Gene duplication due to unequal crossing over
during meiosis can create gene families.
• Sequence and function of different members
of a gene family can diverge.
Gene duplication
Sequence homology (1)
• Genes are said to be homologous if they share
a common evolutionary ancestor.
• Orthologues are genes in different species
that evolved from a common ancestral gene
by speciation. Normally, orthologues retain
the same function in the course of evolution.
(e.g. myoglobin in mammals).
Sequence homology (2)
• Paralogous genes are related by duplication within a
genome. Paralogues often evolve new functions,
even if these are related to the original one.
• In-paralogues, paralogues that were duplicated after
a speciation and are therefore in the same species
• Out-paralogues, paralogues that were duplicated
before a speciation. Not necessarily in the same
species.
Orthology and paralogy
Paralogues
A, B and C are different species
α and β are different paralogues of
the same gene
Out-paralogues
In-paralogues
Evolution of globin superfamily in human lineage
TOR gene duplication events in fungi
TOR: protein kinase,
subunit of a complex
that regulate cell growth
in response to nutrient
availability and cellular
stresses
Taxon sampling methods
• BLAST easiest – though subjective
• Occurence of Pfam (protein family) motif
• Clustering e.g.
– INPARANOID http://inparanoid.sbc.su.se/cgibin/index.cgi
– orthoMCL http://www.orthomcl.org/cgibin/OrthoMclWeb.cgi
Minimum bootstrap
• 70% bootstrap is thought to be broadly similar
to P-value 0.05
• Minimum bootstrap used depends on study
• To improve bootstrap support
– remove poorly aligned sequences if possible, can
be due to mis-annotation of genomes.
– Change taxon sampling
Collapse branches with bootstrap less
than defined value
Lateral gene transfer (purine-cytosine permease)
oomycete
fungi
Eukaryotic Tree of Life
Phytophthora sojae
Aspergillus oryzae
Genes that evolve quickly (1)
• Synonymous substitution – change in DNA
sequence that does not affect the amino acid
sequence, often in the third position of a
codon, e.g. CCG (Pro)→CCA (Pro).
• Non-synonymous substitution - change in DNA
sequence that does affect the amino acid
sequence, often in the first or second position
of a codon, e.g. CCG (Pro)→CAG (Gln).
Genes that evolve quickly (2)
• For a given protein encoding gene (comparison
between orthologues in more than one species)
• dN=number of non-synonomous mutations
• dS=number of synonomous mutations
• We can calculate the ratio dN/dS.
• For most genes this is < 1
• Genes under evolutionary pressure to change
protein sequence (diversify), dN/dS > 1
Genes that evolve quickly (3)
• CodeML (part of the PAML package) will calculate
dN/dS for a set of orthologues from different (closely
related) species.
• Human vs Chimpanzee – rapidly evolving genes
involved in immunity, reproduction and olfaction
(smell).
• Genes with very low dN/dS (under purifying
selection) involved in metabolism, intracellular
signalling, nerve / brain function.
Download