GENOME EVOLUTION AND GENE DUPLICATIONS IN EUKARYOTES Shin-Han Shiu Plant Biology / QBMI Michigan State University Genomes and gene contents 17,000 45,000 6,000 10,000 30,000 25,000 Duplicate genes in the genome Arabidopsis gene families* *: Clusters of Markov clustering using all-against-all BLAST E values as distance measures Gene function and duplication What’s the consequence? Gene function and duplication What’s the consequence? Focus I: Duplication Mechanism and Loss Rate Gene Duplications Mechanisms Preferential retention Consequences Duplication mechanisms Whole genome duplication + Tandem duplication Segmental duplication Replicative transposition Lineage-specific gains in plants and animals Substantially more recent duplicates in plants than in animals Mostly due to frequent whole genome duplications in plants Organism Lineage-specific gains Normalized gain* # of genes in families analyzed % total Rice 10115 6743 28467 35.5 (23.7)** Arabidopsis 5984 3990 21936 27.3 (18.2)** Human 811 811 21954 3.7 Mouse 1265 1265 24041 5.3 *: The gain counts are normlized against the ratio between the Arabidopsis-rice and human-mouse divergence time (150 and 100 Mya, respectively). **: Numbers in parentheses refer to percentage total based on normalized gains. Gain vs. Loss 3 rounds of whole-genome duplications in the Arabidopsis lineage ~82% duplicates from the last round were lost in the past 40 million years 120,000 15,000* 30,000 60,000 Arabidopsis Genome duplications + tandem duplications – gene losses = gene content: 21,000** *: Number of orthologous groups in shared families between Arabidopsis and rice. **: Number of genes in shared families. “Age” distribution of animal duplicates Steady decay in the number of duplicates Frequent TD, SD, and RT Ks: rate of nucleotide substitutions in codon sites that do not affect amino acid identity Shiu et al., 2006 Plant duplicate “age” distribution Apparent peak at ~0.18 instead of zero Ks Frequent WGD, TD, SD (maybe), and RT (in some plants) Shiu et al., 2004 Genome remodeling in polyploids Natural and synthetic polyploids ~314 Mb ~257 Mb 20,000 yr ~348 Mb ~203 Mb Experimental approaches Genome-wide polymorphism monitored by tiling array Gap Resolution Genome Tiled probes Array ~6 million features 20,000 yr Genome-wide Single Feature Polymorphism Mid-parent (MP) vs. Arabidopsis suecica (As) Polyploid SFP Natural 58,517 Synthetic 503 Genome-wide Single Feature Polymorphism Genome-wide polymorphism monitored by tiling array Gene Pseudogene Transposon Genome-wide Single Feature Polymorphism Duplication or deletion MP duplication or As deletion Genome Survey Sequencing Sequence ~40-60Mb of the Arabidopsis suecica genome 0.15-0.2 X coverage, will be done next week! Ultra-high throughput sequencer (GS20) funded by the Strategic Partnership Grant Ultra-high throughput 20-30 Mb per run, each run 5 hours Will be 100Mb per run early 2007 Cost efficient ~$0.3/kb Read length rather limited ~100bp per read now Will be ~200bp early 2007 For more information contact: Andreas Weber (aweber@msu.edu) David DeWitt (dewittd@msu.edu) Or Shin-Han Shiu (shius@msu.edu) Seminar on instrumentation: 9/29, Friday, 1pm, 1415 BPS Summary: Gene duplication and polyploidy Gene duplication occurred frequently in eukaryotes but most duplicate are lost. In plants, whole genome duplication is common. But gene lost occurred frequently. After 4 generations, very small number of SFPs are identified in synthetic polyploids. After 20,000 generations, most coding genes do not have clustered sequence polymorphism that indicative of deletion. Clustered polymorphisms mostly locate in pseudogenes and transposons. Survey sequencing is necessary to determine if some coding genes have become pseudogenes without being deleted. Focus II: Differential Retention of Duplicates Gene Duplications Mechanisms Preferential retention Consequences Duplicate genes in the genome Arabidopsis gene families* *: Clusters of Markov clustering using all-against-all BLAST E values as distance measures Large gene families in plants One of the largest gene families Normalized gain: % expanded OGs Large family sizes do not necessarily indicates higher expansion rates Ancestral family sizes and gene gains Large ancestral family tend to have more lineage specific gains but with many exceptions Differential expansion of functional categories GO: GeneOntology Protein ubiquitination Polysaccharide biosynthesis Cell wall modification Transcriptional regulation Biotic stress response Secondary metabolism Differences in Duplicability Duplicability The propensity for the retention of a duplicate gene Computational analysis of genome-wide trend Category Defense response Proteolysis Transport Ion channel activity Metabolism Development Protein kinase activity Transcription factor activity Arabidopsis Human Kinase superfamily sizes among eukaryotes Number of genes Kinase superfamily Percent total gene 25,814 1041 4.0 Oryza sativa subsp. indica ~35,000 1607 3.6 Chlamydomonas reinhardtii ~12,200 414 3.4 Plasmodium falciparum 5,334 94 1.8 Plasmodium yoelii 7,681 70 0.9 Caenorhabditis elegans 19,484 417 2.1 Drosophila melanogaster 13,808 262 1.9 Anopheles gambiae 15,088 216 1.4 Ciona intestinalis 15,852 316 2.0 Fugu rubripes 33,609 632 1.9 Mus musculus 22,444 495 2.2 Homo sapiens 22,980 472 2.1 Saccharomyces cerevisiae 6449 113 1.8 Candida albicans 6,164 95 1.5 Neurospora crassa 10082 104 1.9 Schizosaccharomyces pombe 4945 109 2.2 Organism Arabidopsis thaliana Shiu & Bleecker, 2003 Kinase families in rice and Arabidopsis Gene count differences among families indicate differential expansion Shiu et al., 2004 Estimation of ancestral RLK family size Kinase phylogeny of Arabidopsis and rice RLKs 440 speciation points rice Arabidopsis A. A. WAK B. B. LRR VIII, X, XII Shiu et al., 2004 Development vs. resistance/defense RLKs Shiu et al., 2004 Contradiction Plant genes invovled in development tend to have high duplicability Resistance/Defense RLKs Developmental RLKs Animal tyrosine kinases High duplicability Low duplicability Low duplicability Transcription factors High duplicability Selection for expansion Depend on the level of variations of the signals OR T T Summary: differential retention Longevity and duplicability of plant genes Duplicability Longevity High High Transcription factors High Low Resistance genes Low High Enzymes in central metabolic pathways Low Low ?? Examples Focus III: Functional Consequences Gene Duplications Mechanisms Preferential retention Consequences Functional Consequences of Duplication Functional divergence and conservation Is it because of changes in cis-regulatory elements or coding sequences How are duplicates retained, subfunctionalization or neofunctionalization Divergence in gene expression Develop pipelines for cis-element prediction and Expression data Clusters of genes with similar expression profiles Cis-regulatory logic Machine learning Experimental validations Motif functional prediction Over-represented sequence motifs in 5’ regions Divergence in post-translational modification Conservation of phosphorylation site across speces SACE: budding yeast CAGL: Candida glabra CAAL: Candida albicans CATR: Candida tropicalis NECR: Neurospora crassa DEHA: Debaryomuces hansenii Detailed Functional Studies of Duplicate Genes Functional analyses of DDF1 and DDF2 transcription factors Derived from recent whole genome duplication in Arabidopsis Related to the well known CBF factors involved in cold and draught stress Arabidopsis thaliana Promoter GFP Knockouts DDFs Binding targets Arabidopsis lyrata Promoter GFP Overexpression studies Interacting proteins Knockouts DDFs Binding targets Overexpression studies Interacting proteins Focus IV: Protein space Gene Duplications Mechanisms Preferential retention Consequences Tiling array analysis of transcriptome Human Chr 21, 22 Kapranov et al., 2002 Posterior probability p(F|coding) Performance of the CI measure Known Arabidopsis exon and intron 90-300bp Arabidopsis small protein that are not annotated Correctly predict 19 out of 20 (95%). Yesat sORF with translation evidence Correctly predict 98 out of 114 (86%) In “intergenic” sequences of Arabidopsis genome 3,274 sORF identified Coupling with tiling array expression Hybridization intensities for feature types Summary: Novel coding genes Many unannotated regions in the genomes are expressed. Using the CI measure, many proteins that were not annotated but with evidence of expression from yeast and Arabidopsis are identified correctly. Using the CI measure, we estimated that ~3000 novel coding regions are present in the unannotated regions of Arabidopsis thaliana genome. Using tiling array data, we found that many of these novel coding regions are expressed. Acknowledgement Lab members Kousuke Hanada Melissa Lehti-Shiu Cheng Zou Emily Eckenrode University of Chicago Justin Borevitz Xu Zhang University of Wisconsin Sara Patterson Rick Vierstra University of Missouri Scott Peck Michigan State University Many… Rong Jin, Comp Sci & Eng Yue-Hua Cui, Stat & Prob Startup fund Recent completion … Genome remodeling in polyploids Genome duplication occur frequently in plants What is the fate of duplicates? How fast do gene losses occur? Is there any preference in genes retained? Ng = A B A1 B1 A2 B2 C D C1 D1 C2 D2 E E1 E2 5 10 t1 A1 B1 A2 B2 C1 D1 C2 D2 E1 E2 8 t2 A1 B1 A2 B2 C1 D1 C2 D2 E1 E2 5 Comparing degrees of expansion Arabidopsis: ~25,000 proteins Rice prediction: ~66,000 genes Combined set Gene/domain families unique GO:0001 Shared ui = 1 Pairwise distance ei = 4 Putative orthologous groups All orthologous groups Total unexpanded = Σ ui Total expanded = Σ ei Major questions on gene duplication When: timing of gene duplications, e.g. N = 10 Domain gains in rice and Arabidopsis Gain in one lineage does not necessarily predict gain in the other Identify novel small coding genes Determine base composition probabilities Coding sequences CDS parameters Non-coding sequences NCDS parameters Pc(AAA) = Pc(T|AAA) = # of AAA # of all NNN Pc(AAAT) Pc(AAA) Feature tables c1 c2 c3 c4 c5 c6 n Calculate posterior probability P(CDS | S ) P(S | CDS) P(CDS) P(S | CDS) P(CDS) P(S | NCDS) P( NCDS) Setting up the Bayes’ Priors P(S | CDS) P(CDS) P(S | CDS) P(CDS) P(S | NCDS) P( NCDS) 1 P(CDS) P( NCDS) 2 P(CDS | S ) 1 1 P(CDS1) P(CDS2 ) ... P(CDS6 ) 2 6 6 P(S | CDS) P(CDS) P(S | CDSm ) P(CDSm ) m 1 S = ATG TTC TAC TTT G… P(S | CDS1) Pc1( ATG) Pc1(T | ATG) Pc2(T |TGT ) Pc3(C | GTT ) Pc1(T |TTC)... P(S | CDS2) Pc2( ATG) Pc2(T | ATG) Pc3(T |TGT ) Pc1(C | GTT ) Pc2(T |TTC)... … P(S | CDS6) Pc6( ATG) Pc6(T | ATG) Pc4(T |TGT ) Pc5(C | GTT ) Pc6(T |TTC)... P(S | CDSn ) Pn ( ATG) Pn (T | ATG) Pn (T |TGT ) Pn (C | GTT ) Pn (T |TTC)... Coding Likelihood (CL) Sliding windows of a sequence 1 2 3 4 … P(CDS | S n ) CL n Simulation based on NCDS (introns) n Divergence in post-translational modification Conservation of phosphorylation site across speces