GENOME EVOLUTION AND GENE DUPLICATIONS IN

advertisement
GENOME EVOLUTION AND GENE
DUPLICATIONS IN EUKARYOTES
Shin-Han Shiu
Plant Biology / QBMI
Michigan State University
Genomes and gene contents
17,000
45,000
6,000
10,000
30,000
25,000
Duplicate genes in the genome

Arabidopsis gene families*
*: Clusters of Markov clustering using all-against-all BLAST E values as distance measures
Gene function and duplication

What’s the consequence?
Gene function and duplication

What’s the consequence?
Focus I: Duplication Mechanism and Loss Rate
Gene
Duplications
Mechanisms
Preferential
retention
Consequences
Duplication mechanisms

Whole genome duplication
+

Tandem duplication

Segmental duplication

Replicative transposition
Lineage-specific gains in plants and animals


Substantially more recent duplicates in plants than in animals
Mostly due to frequent whole genome duplications in plants
Organism
Lineage-specific
gains
Normalized
gain*
# of genes in
families
analyzed
% total
Rice
10115
6743
28467
35.5 (23.7)**
Arabidopsis
5984
3990
21936
27.3 (18.2)**
Human
811
811
21954
3.7
Mouse
1265
1265
24041
5.3
*: The gain counts are normlized against the ratio between the Arabidopsis-rice and human-mouse divergence
time (150 and 100 Mya, respectively).
**: Numbers in parentheses refer to percentage total based on normalized gains.
Gain vs. Loss


3 rounds of whole-genome duplications in the Arabidopsis lineage
~82% duplicates from the last round were lost in the past 40 million
years
120,000
15,000*
30,000
60,000
Arabidopsis
Genome duplications + tandem duplications – gene losses = gene content:
21,000**
*: Number of orthologous groups in shared families between Arabidopsis and rice.
**: Number of genes in shared families.
“Age” distribution of animal duplicates


Steady decay in the number of duplicates
Frequent TD, SD, and RT
Ks: rate of nucleotide substitutions in codon sites that do not affect amino acid identity
Shiu et al., 2006
Plant duplicate “age” distribution


Apparent peak at ~0.18 instead of zero Ks
Frequent WGD, TD, SD (maybe), and RT (in some plants)
Shiu et al., 2004
Genome remodeling in polyploids

Natural and synthetic polyploids
~314 Mb
~257 Mb
20,000 yr
~348 Mb
~203 Mb
Experimental approaches

Genome-wide polymorphism monitored by tiling array
Gap
Resolution
Genome
Tiled probes
Array
~6 million features
20,000 yr
Genome-wide Single Feature Polymorphism

Mid-parent (MP) vs. Arabidopsis suecica (As)
Polyploid
SFP
Natural
58,517
Synthetic
503
Genome-wide Single Feature Polymorphism

Genome-wide polymorphism monitored by tiling array
Gene
Pseudogene
Transposon
Genome-wide Single Feature Polymorphism

Duplication or deletion
MP duplication or
As deletion
Genome Survey Sequencing

Sequence ~40-60Mb of the Arabidopsis suecica genome
 0.15-0.2 X coverage, will be done next week!

Ultra-high throughput sequencer (GS20) funded by the
Strategic Partnership Grant
 Ultra-high throughput
 20-30 Mb per run, each run 5 hours
 Will be 100Mb per run early 2007
 Cost efficient
 ~$0.3/kb
 Read length rather limited
 ~100bp per read now
 Will be ~200bp early 2007
 For more information contact:
 Andreas Weber (aweber@msu.edu)
 David DeWitt (dewittd@msu.edu)
 Or Shin-Han Shiu (shius@msu.edu)
 Seminar on instrumentation:
 9/29, Friday, 1pm, 1415 BPS
Summary: Gene duplication and polyploidy

Gene duplication occurred frequently in eukaryotes but most
duplicate are lost.

In plants, whole genome duplication is common. But gene lost
occurred frequently.

After 4 generations, very small number of SFPs are identified in
synthetic polyploids.

After 20,000 generations, most coding genes do not have clustered
sequence polymorphism that indicative of deletion.

Clustered polymorphisms mostly locate in pseudogenes and
transposons.

Survey sequencing is necessary to determine if some coding genes
have become pseudogenes without being deleted.
Focus II: Differential Retention of Duplicates
Gene
Duplications
Mechanisms
Preferential
retention
Consequences
Duplicate genes in the genome

Arabidopsis gene families*
*: Clusters of Markov clustering using all-against-all BLAST E values as distance measures
Large gene families in plants

One of the largest gene families
Normalized gain: % expanded OGs

Large family sizes do not necessarily indicates higher expansion rates
Ancestral family sizes and gene gains

Large ancestral family tend to have more lineage specific gains but
with many exceptions
Differential expansion of functional categories

GO: GeneOntology






Protein ubiquitination
Polysaccharide biosynthesis
Cell wall modification
Transcriptional regulation
Biotic stress response
Secondary metabolism
Differences in Duplicability

Duplicability
 The propensity for the retention of a duplicate gene
 Computational analysis of genome-wide trend
Category
Defense response
Proteolysis
Transport
Ion channel activity
Metabolism
Development
Protein kinase activity
Transcription factor activity
Arabidopsis
Human
Kinase superfamily sizes among eukaryotes
Number of
genes
Kinase
superfamily
Percent
total gene
25,814
1041
4.0
Oryza sativa subsp. indica
~35,000
1607
3.6
Chlamydomonas reinhardtii
~12,200
414
3.4
Plasmodium falciparum
5,334
94
1.8
Plasmodium yoelii
7,681
70
0.9
Caenorhabditis elegans
19,484
417
2.1
Drosophila melanogaster
13,808
262
1.9
Anopheles gambiae
15,088
216
1.4
Ciona intestinalis
15,852
316
2.0
Fugu rubripes
33,609
632
1.9
Mus musculus
22,444
495
2.2
Homo sapiens
22,980
472
2.1
Saccharomyces cerevisiae
6449
113
1.8
Candida albicans
6,164
95
1.5
Neurospora crassa
10082
104
1.9
Schizosaccharomyces pombe
4945
109
2.2
Organism
Arabidopsis thaliana
Shiu & Bleecker, 2003
Kinase families in rice and Arabidopsis

Gene count differences among families indicate differential expansion
Shiu et al., 2004
Estimation of ancestral RLK family size

Kinase phylogeny of Arabidopsis and rice RLKs
440 speciation points
rice
Arabidopsis
A.
A.
WAK
B.
B.
LRR VIII, X, XII
Shiu et al., 2004
Development vs. resistance/defense RLKs
Shiu et al., 2004
Contradiction

Plant genes invovled in development tend to have high
duplicability
Resistance/Defense
RLKs
Developmental
RLKs
Animal tyrosine
kinases
High duplicability
Low duplicability
Low duplicability
Transcription factors
High duplicability
Selection for expansion

Depend on the level of variations of the signals
OR
T
T
Summary: differential retention

Longevity and duplicability of plant genes
Duplicability
Longevity
High
High
Transcription factors
High
Low
Resistance genes
Low
High
Enzymes in central metabolic
pathways
Low
Low
??
Examples
Focus III: Functional Consequences
Gene
Duplications
Mechanisms
Preferential
retention
Consequences
Functional Consequences of Duplication

Functional divergence and conservation
 Is it because of changes in cis-regulatory elements or coding sequences
 How are duplicates retained, subfunctionalization or neofunctionalization
Divergence in gene expression

Develop pipelines for cis-element prediction and
Expression data
Clusters of
genes with similar
expression profiles
Cis-regulatory
logic
Machine learning
Experimental
validations
Motif functional
prediction
Over-represented
sequence motifs
in 5’ regions
Divergence in post-translational modification

Conservation of phosphorylation site across speces






SACE: budding yeast
CAGL: Candida glabra
CAAL: Candida albicans
CATR: Candida tropicalis
NECR: Neurospora crassa
DEHA: Debaryomuces hansenii
Detailed Functional Studies of Duplicate Genes

Functional analyses of DDF1 and DDF2 transcription factors
 Derived from recent whole genome duplication in Arabidopsis
 Related to the well known CBF factors involved in cold and draught stress
Arabidopsis thaliana
Promoter
GFP
Knockouts
DDFs
Binding
targets
Arabidopsis lyrata
Promoter
GFP
Overexpression
studies
Interacting
proteins
Knockouts
DDFs
Binding
targets
Overexpression
studies
Interacting
proteins
Focus IV: Protein space
Gene
Duplications
Mechanisms
Preferential
retention
Consequences
Tiling array analysis of transcriptome
 Human Chr 21, 22
Kapranov et al., 2002
Posterior probability p(F|coding)
Performance of the CI measure

Known Arabidopsis exon and intron 90-300bp

Arabidopsis small protein that are not annotated
 Correctly predict 19 out of 20 (95%).

Yesat sORF with translation evidence
 Correctly predict 98 out of 114 (86%)

In “intergenic” sequences of Arabidopsis genome
 3,274 sORF identified
Coupling with tiling array expression

Hybridization intensities for feature types
Summary: Novel coding genes

Many unannotated regions in the genomes are expressed.

Using the CI measure, many proteins that were not annotated but
with evidence of expression from yeast and Arabidopsis are identified
correctly.

Using the CI measure, we estimated that ~3000 novel coding regions
are present in the unannotated regions of Arabidopsis thaliana
genome.

Using tiling array data, we found that many of these novel coding
regions are expressed.
Acknowledgement

Lab members
Kousuke Hanada
Melissa Lehti-Shiu
Cheng Zou
Emily Eckenrode

University of Chicago
 Justin Borevitz
 Xu Zhang

University of Wisconsin
 Sara Patterson
 Rick Vierstra

University of Missouri
 Scott Peck

Michigan State University
 Many…
 Rong Jin, Comp Sci & Eng
 Yue-Hua Cui, Stat & Prob
 Startup fund
Recent completion …
Genome remodeling in polyploids


Genome duplication occur frequently in plants
What is the fate of duplicates?
 How fast do gene losses occur?
 Is there any preference in genes retained?
Ng =
A
B
A1
B1
A2
B2
C
D
C1
D1
C2
D2
E
E1
E2
5
10
t1
A1
B1
A2
B2
C1
D1
C2
D2
E1
E2
8
t2
A1
B1
A2
B2
C1
D1
C2
D2
E1
E2
5
Comparing degrees of expansion
Arabidopsis:
~25,000 proteins
Rice prediction:
~66,000 genes
Combined set
Gene/domain
families
unique
GO:0001
Shared
ui = 1
Pairwise distance
ei = 4
Putative
orthologous groups
All orthologous groups
Total unexpanded = Σ ui
Total expanded
= Σ ei
Major questions on gene duplication

When: timing of gene duplications, e.g. N = 10
Domain gains in rice and Arabidopsis

Gain in one lineage does not necessarily predict gain in the other
Identify novel small coding genes

Determine base composition probabilities
Coding
sequences
CDS
parameters
Non-coding
sequences


NCDS
parameters
Pc(AAA) =
Pc(T|AAA) =
# of AAA
# of all NNN
Pc(AAAT)
Pc(AAA)
Feature tables
c1
c2
c3
c4
c5
c6
n
Calculate posterior probability
P(CDS | S ) 
P(S | CDS) P(CDS)
P(S | CDS) P(CDS)  P(S | NCDS) P( NCDS)
Setting up the Bayes’

Priors
P(S | CDS) P(CDS)
P(S | CDS) P(CDS)  P(S | NCDS) P( NCDS)
1
P(CDS)  P( NCDS) 
2
P(CDS | S ) 
1 1
P(CDS1)  P(CDS2 )  ... P(CDS6 )  
2 6
6
P(S | CDS) P(CDS)    P(S | CDSm ) P(CDSm )
m 1

S = ATG TTC TAC TTT G…
P(S | CDS1)  Pc1( ATG) Pc1(T | ATG) Pc2(T |TGT ) Pc3(C | GTT ) Pc1(T |TTC)...
P(S | CDS2)  Pc2( ATG) Pc2(T | ATG) Pc3(T |TGT ) Pc1(C | GTT ) Pc2(T |TTC)...
…
P(S | CDS6)  Pc6( ATG) Pc6(T | ATG) Pc4(T |TGT ) Pc5(C | GTT ) Pc6(T |TTC)...
P(S | CDSn )  Pn ( ATG) Pn (T | ATG) Pn (T |TGT ) Pn (C | GTT ) Pn (T |TTC)...
Coding Likelihood (CL)

Sliding windows of a sequence
1 2 3 4 …

 P(CDS | S n )
CL 
n
Simulation based on NCDS (introns)
n
Divergence in post-translational modification

Conservation of phosphorylation site across speces
Download