PowerPoint 演示文稿

advertisement
Whole-Genome Prokaryote
Phylogeny without
Sequence Alignment
Bailin HAO
and
Ji QI
T-Life Research Center, Fudan University
Shanghai 200433, China
Institute of Theoretical Physics, Academia Sinica
Beijing 100080, China
http://www.itp.ac.cn/~hao/
Classification of Prokaryotes:
A Long-Standing Problem
• Traditional taxonomy: too few features
• Morphology:spheric, helices, rod-shaped……
• Metabolism:photosythesis, N-fixing, desulfurization……
• Gram staining:positive and negative
• SSU rRNA Tree (Carl Woese et al., 1977):
– 16S rRNA: ancient conserved sequences of about
1500kb
– Discovery of the three domains of life: Archaea,
Bacteria and Eucarya
– Endosymbiont origin of mitochondria and
chloroplasts
The SSU rRNA Tree of Life:
A big progress in molecular phylogeny
of prokaryotes as evidenced by the
history of the
Bergey’s Manual
Bergey’s Manual Trust:
Bergey’s Manual
• 1st Ed. “Determinative Bacteriology”: 1923
• 8th Ed. “Determinative Bacteriology”: 1974
• 1st Ed. “Systematic Bacteriology”: 1984-1989, 4
volumes
• 9th Ed. “Determinative Bacteriology”: 1994
• 2nd Ed. “Systematic Bacteriology”: 2001-200?, 5
volumes planned; On-Line “Taxonomic Outline of
Procarytes” by Garrity et al. (October 2003)
(26 phyla: A1-A2, B1-B24)
Our Final Result
•
•
•
•
132 organisms (16A + 110B + 6E)
Input: genome data
Output: phylogenetic tree
No selection of genes, no alignment of
sequences, no fine adjustment whatsoever
• See the tree first. Story follows.
Protein Tree for 145 Organisms
From 82 Genera
(K=5)
16 Archaea (11 genera, 16 species)
123 Bacteria (65 genera, 98 species)
6 Eukaryotes
Complete Bacterial Genomes Appeared
since 1995
Early Expectations:
• More support to the SSU rRNA Tree of Life
• Add details to the classification (branchings
and groupings)
• More hints on taxonomic revisions
Confusion brought by the
hyperthermophiles
– Aquifex aeolicus (Aquae)
1998: 1551335
– Thermotoga maritima (Thema) 1999: 1860725
– “Genome Data Shake tree of life”
Science 280 (1 May 1998) 672
– “Is it time to uproot the tree of life?”
Science 284 (21 May 1999) 130
– “Uprooting the tree of life”
W. Ford Doolittle, Scientific American (February 2000) 90
Debate on Lateral Gene Transfer
• Extreme estimate: 17% in E. Coli
Limitations of the above approach
B. Wang, J. Mol. Evol. 53 (2001) 244
• “Phase transition” and “crystalization” of species
(C. Woese 1998)
• Lateral transfer within smaller gene pools as an
innovative agent
• Composition vector may incorporate LGT within
small gene pools
Alignment-Based Molecular Phylogeny
• TCAGACGC
• TCGGAGT
TCAGACGC
TC GGA -GT
Scoring scheme
Gap penalty
16S rRNA tree was based on sequence alignment
– Problem: sequence alignment cannot be
readily applied to complete genomes
– Homology -> alignment
– Different genome size, gene content and
gene order
Gene A
B
C
1st species
2nd species
Gene B’
A’
?
Our Motivations:
• Develop a molecular phylogeny method that makes
use of complete genomes – no selection of
particular genes
• Avoid sequence alignment
• Try to reach higher resolution to provide an
independent comparison with other approaches
such as SSU tRNA trees
• Make comparison with bacteriologists’ systematics
as reflected in Bergey’s Manual (2001, 2002)
• Our paper accepted by J. Molecular Evolution
Other Whole-Genome Approaches
•
•
•
•
•
Gene content
Presence or absence of COGs
Conserved Gene Pairs
“Information” distances
Domain order in proteins (Ken Nishikawa’s
talk at InCoB2003)
• …
Comparison of Complete
Genomes/Proteomes
• Compositional vectors
}}
Nucleotides: a、t、c、g
aatcgcgcttaagtc
Di-nucleotide (K=2) distribution:
{aa at ac ag ta tt tc tg ca ct cc cg ga gt gc gg}
{ 2 ,1 ,0 , 1 , 1 ,1, 1, 0, 0, 1, 0, 2, 0, 1 ,2 , 0}
K-strings make a composition vector
• DNA sequence  vector of dimension 4K
• Protein sequence  vector of dimension 20K
• Given a genomic or protein sequence  a unique
composition vector
↑
• The converse:
a vector  one or more sequences?
• K big enough -> uniqueness
• Connection with the number of Eulerian loops in a
graph (a separate study available as a preprint at
ArXiv:physics/0103028 and from Hao’s webpage)
A Key Improvement:
Subtraction of Random Background
• Mutations took place randomly at molecular
level
• Selection shaped the direction of evolution
• Many neutral mutations remain as random
background
• At single amino acid level protein sequences are
quite close to random
• Highlighting the role of selection by subtraction
a random background
Frequency and Probability
•
•
•
•
A sequence of length L
A K-string 1 2  K
Frequency of appearance
Probability
f (1 2  K )
f (1 2  K )
P (1 2  K ) 
L  K 1
Predicting #(K-strings) from that of
lengths (K-1) and (K-2) strings
Joint probability vs. conditional probability
p(1 2  K )  p( K 1 2  K 1 ) p(1 2  K 1 )
Making the weakest Markov assumption:
p(1 2  K )  p( K  2  K 1 ) p(1 2  K 1 )
Another joint probability:
p( 2  K 1 K )  p( K  2  K 1 ) p( 2  K 1 )
(K-2)-th Order Markov Model
p(1 2  K 1 ) p( 2  K 1 K )
p (1 2  K 1 K ) 
p( 2  K 1 )
0
Change to frequencies:
f (1  K 1 ) f ( 2  K ) ( L  K  1)( L  K  3)
f (1  K ) 
f ( 2  K 1 )
( L  K  2) 2
0
Normalization factor may be ignored when L>>K
Construct compositional
vectors using these
modified string counts:
For the i-th string type of species A we use
ai  ai0
 ai
0
ai
Composition Distance
• Define correlation between two compositional
vectors by the cosine of angle
– From two complete proteomes:
A:{a1,a2,……,an}
B:{b1,b2,……,bn}
C ( A, B ) 
n=205 = 3 200 000
a
i
 bi
i
( a 2j  b 2j )
j
j
C(A,B) ∈[-1,1]
• Distance
–
1 C
D( A, B) 
2
D(A,B)∈[0,1]
1
2
Materials: Genomes from NCBI
(ftp.ncbi.nih.gov/genomes/Bacteria/)
Not the original GenBank files
Phyla
Archaea 2
Bacteria 13
Total
15
Classes
8
18
26
Orders
11
37
48
Families
11
46
57
Genera
11
58
69
Species
16
88
104
6 Eucaryote genomes were included for reference
Tree construction: Neighbor-Joining in Phylip
Strains
16
110
126
Protein Tree for 132 species
(K=5)
16 Archaea (11 genera, 16 species)
110 Bacteria (57 genera, 88 species)
6 Eukaryotes
Protein Tree for 132 species
K=6
16 Archaea (11 genera, 16 species)
110 Bacteria (57 genera, 88 species)
6 Eukaryotes
Protein Class vs. Whole Proteome
• Trees based on collection of ribosomal proteins
(SSU + LSU): ribosomal proteins are interwoven
with rRNA to form functioning complex; results
consistent with SSU rRNA trees
• Trees based on collection of aminoacyl-tRNA
synthetases (AARS). Trees based on single
AARS were not good. Trees based on all 20
AARSs much better but not as good as that
based on rProteins.
Genus Tree
based on
Ribosomal
Proteins
A Genus
Tree based
on
Aminoacyl
tRNA
synthetases
Chloroplast Tree
• Sequences of about 100 000 bp
• Tree of the endosymbiont partners
• Paper accepted by Molecular Biology and
Evolution on 12 August 2003
Chloroplast
tree
Coronaviruses including
Human SARS-CoV
• Sequences of tens kilo bases
• SARS squence: about 29730 bases
• Paper published in Chinese Science Bulletin
on 26 June 2003
Coronavirus
tree
Understanding the Subtraction Procedure:
Analysis of Extreme Cases in E. coli
• There are 1 343 887 5-strings belonging to
841832 different types.
• Maximal count before subtraction: 58 for the
5-peptide GKSTL. 58 reduces to 0.646 after
subtraction.
• Maximal component after subtraction: 197 for the
5-peptide HAMSC. The number 197 came from a
single count 1 before the subtraction.
GKSTL: how 58 reduces to 0.646?
• #(GKST)=113
• #(KSTL)=77
• #(KST)=247
• Markov prediction: 113*77/247=35.23
• Final result: (58-35.23)/35.23=0.646
HAMSC: how 1 grows to 197?
• #(HAMS)=1
• #(AMSC)=1
• #(AMS)=198
• Markov prediction: 1*1/198=1/198
• Final result: (1-1/198)/(1/198)=197
6121 Exact Matches of GKSTL
In PIR Rel.1.26 with >1.2 Mil Proteins
• These 6121 matches came from a diverse
taxonomic assortment from virus to bacteria to
fungi to plants and animals including human
being
• In the parlance of classic cladistics GKSTL
contributes to plesiomorphic characters that
should be eliminated in a strict phylogeny
• The subtraction procedure did the job.
15 Exact Matches of HAMSC:
In PIR Rel.1.26 with >1.2 Mil Proteins
• 1 match from Eukaryotic protein
• 4 matches (the same protein) from virus
• 10 matches from prokaryotes, among which
3 from Shegella and E. coli (HAMSCAPDKE)
3 from Samonella
(HAMSCAPERD)
HAMSC is characteristic for prokaryotes
HAMSCA is specific for enterobacteria
Stable Topology of the Tree
•
•
•
•
K=1: makes some sense!
K=2,3,4: topology gradually converges
K=5 and K=6: present calculation
K=7 and more: too high resolution; startree or bush expected
Statistical Test of the Tree
• Bootstrap versus Jack knife
• Bootstrap in sequence alignments
• “Bootstrap” by random selections
from the AA-sequence pool
• A time consuming job
• 180 bootstraps for 72 species
About 70%
genes for
every species
were selected
in one
bootstrap
“K-string Picture” of Evolution
• K=5 ->3 200 000 points in space of
5-strings
• K=6 ->64 000 000 points
• In the primordial soup: short polypeptides
of a limited assortment
• Evolution by growth, fusion, mutation leads
to diffusion in the string space
• String space not saturated yet
The Problem of Higher Taxa
• 1974: Bacteria as a separate kingdom
• 1994: Archaea and Bacetria as two domains
• The relation of higher taxa?
Summary
 As composition vectors do not depend on genome size
and gene content. The use of whole genome data is
straightforward
 Data independent on that of 16S rRNA
 Method different from that based on SSU rRNA
 Results agree with SSU rRNA trees and the Bergey’s
Manual
 Hint on groupings of higher taxa
 A method without “free parameters”: data in, tree out
 Possibility of an automatic and objective classification
tool for prokaryotes
Conclusion:
The Tree of Life is saved!
There is phylogenetic information in the
prokaryotic proteomes.
Time to work on molecular definition of taxa.
Thank you!
Protein Tree for 132 species
(K=5)
16 Archaea (11 genera, 16 species)
110 Bacteria (57 genera, 88 species)
6 Eukaryotes
A Failed Attempt Using
Avoidance Sinatures
Comparison with the
Bergey’s Manual
• Tree Construction
– phylip package of J. Felsenstein (Neighbor-Joining)
– The Fitch method is not
(2n  5)!
N n  n 3
feasible here, N 72  10120
2  (n  3)!
– Nondistance-matrix method (MP, ML et al)
• Material
Phyla
Classes
Orders
Families
Genera
Species
Strains
Archaea
2
7
9
9
9
13
13
Bacteria
9
14
23
28
37
46
57
Total
11
21
32
37
46
59
70
– ftp://ncbi.nlm.nih.gov/genomes/Bacteria/
Early expectation from
genome data
• Was there intensive lateral gene transfer?
• Gene tree cannot be equated to the real tree
of life
• Genome data: 106 to 107
• Difficult to align whole genome data
• Prokaryote and Eukaryote
• Three Kingdoms( Carl Woese ,16S rRNA )
– Archaea
– Eubacteria
– Eukarya
• Five Kingdoms ( Lynn Margulis )
–
–
–
–
–
Bacteria (Archaea,
Protoctista
Animalia
Fungi
Plantae
Eubacteria)
 Common features of Archaea and Eubacteria:
Small cells, no nucleus membrane, ring DNA,
no CAP at 5’end of mRNA, presence of S-D
segments
 Many proteins associated with replication,
transcription, and translation are common in
Archaea and Eukaryote
 Features of Archaea: lack of some enzymes,
insensitive to some antibiotics
《Compositional Representation of Protein
Sequences and the Number of Eulerian Loops》
by Bailin Hao, Huimin Xie, Shuyu Zhang
K=5:
K=6:
76.7% proteins have unique reconstruction
 94.0%
K=10:
>99%
Checked 2820 AA-seqs from pdb.seq, a special
selection of SWISS-PROT
See Los Alamos National Lab E-Archive:
physics/0103028
Subtraction of Random Background
• Using a (K-2)-order Markov Model
P(a1 , a2 ,, aK 1 )  P(a2 , a3 ,, aK )
P(a1 , a2 ,, aK ) 
P(a2 , a3 ,, aK 1 )
• K=2: genomic signature by Karlin and Burge
• May be justified by using Maximal Entropy
Principle with appropriate constraints (Hu &
Wang, 2001)
What to do next
• Detailed comparison with traditional
taxonomy
• Add more eukaryotes
• Elucidation of the foundatrion and
limitation of compositional approach
• Software and web interface
• Problem of lateral gene transfer
• Viruses?
• Confusion brought by the hyperthermophiles
– Aquifex aeolicus (Aqua)
1998: 1551335
– Thermotoga maritima (Tmar) 1999: 1860725
– “Genome Data Shake tree of life”
Science 280 (1 May 1998) 672
– “Is it time to uproot the tree of life?”
Science 284 (21 May 1999) 130
– “Uprooting the tree of life”
Sci. Amer. (February 2000) 9
• Problem of Lateral Gene Transfer (LGT): tree
or network
• Problem of higher taxa
Download