The gene family play and the chromosomal theater Todd Vision Department of Biology University of North Carolina at Chapel Hill Outline Large-scale duplication and loss of genes in the angiosperms Looking into the future of plant phylogenomics A case study in gene family demography Duplication and functional divergence Paul Franz, University of Amsterdam Arabidopsis as a hub for plant comparative maps megabases genome sizes in angiosperms 907 1000 750 560 622 473 367 367 372 415 439 500 262 250 145 0 is ch er ge ya ce go ot am an to s p ea mb an pa ri an rr y be ma o o p cu or pa m ca d a t i m b cu a li r A data from Arumuganathan & Earle (1991)Plant Mol Biol Rep 9:208-218 Tomato-Arabidopsis synteny Bancroft (2001) TIG 17, 89 after Ku et al (2000) PNAS 97, 9121 Duplicated genes in Arabidopsis Modes of gene duplication Tandem (T) • unequal crossing-over • mostly young Dispersed (D) • transposition • all ages Segmental (S) • polyploidy • all old Paleotetraploidy? The Arabidopsis Genome Initiative. 2000. Nature 408:796 Vision et al. (2000) Science 290:2114-7. Microsynteny within blocks 0.10 distribution of dA in blocks not in blocks 0.09 0.08 0.07 0.06 f 0.05 0.04 0.03 0.02 0.01 0.00 0.0 0.1 0.2 0.2 0.3 0.4 0.5 0.5 0.6 0.7 0.8 0.8 0.9 1.0 amino acid substitution Problems • proteins diverge at different rates • high dA is difficult to estimate Solution • average dA within blocks discrete duplication events A B frequency of blocks 12 C D E F 10 8 6 4 2 0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 amino acid substitution 0 50 100 150 200 Mya the 2-4 complex (one ancestral segment broken up by 4 large inversions) 4200 chromosome 4 (4.6 Mb) 52 3800 54 45 3400 56 49 3000 2600 1200 1600 2000 2400 chromosome 2 (5.6 Mb) 2800 70 coefficient of variation = 0.67 60 frequency 50 40 30 20 10 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Ka 120 coefficient of variation = 0.53 100 frequency 80 60 40 20 0 0 1 2 3 Ks 4 5 Rice-Arabidopsis microsynteny Mayer et al. (2001) Genome Res. 11, 1167 Blanc, Hokamp, Wolfe (2003) Genome Res. 13, 137-144. duplication Rice Arabidopsis Rice Arabidopsis Rice Rice Arabidopsis Arabidopsis Block 37 after Asterid-Rosid split Block 57 before monocot-dicot divergence Raes, Vandepoele, Saeys, Simillion, Van de Peer (2003) J. Struct. Func. Genomics 3, 117-129 Divergence among duplicated genes in rice Goff et al. (2002) Science 296: 92 Hidden syntenies Simillion, Vandepoele, Van Montagu, Zabeau, Van de Peer (2002) PNAS 99, 13627 Interspecies comparison can reveal hidden syntenies Vandepoele, Simillion, Van de Peer (2002) TIG 18, 606-608 Comparative mapping in a phylogenetic context Major plant genome datasets Family Genus Aizoaceae Mesembryanthemum crystallinum Brassicaceae Arabidopsis thaliana Brassica spp. Fabaceae Glycine max Medicago truncatula Phaseolus spp. Malvaceae Gossypium arboreum Solanaceae Capsicum annuum Lycopersicon esculentum Solanum tuberosum Poaceae Hordeum vulgare Oryza sativa Sorghum bicolor/propinguim Triticum aestivum Zea mays Other Beta vulgaris Chlamydomonas reinhardtii Pinus taeda Populus spp. Prunus spp. genome X EST X X X X X X X X X X X X X X X map X X X X X X X X X X X X X X X X X X X Plant unigene datasets species barley beet chlamydomonas citrus coffee cotton grape iceplant lettuce lotus maize marchantia medicago oat onion pine poplar potato rice rye sorghum soybean sunflower tomato wheat + Arabidopsis 27170 TIGR 49885 na 30296 na na 24350 49885 8455 21960 11025 55063 na 36976 na 11726 26882 na 24275 60778 5199 33273 67826 20520 31012 109509 PlantGDB 74621 13565 na 4266 392 27854 74621 8945 na na 71655 1059 43384 361 na 24668 20935 24839 52156 5384 34363 73946 na 35725 95949 Wikström et al (2001) Proc R Soc Lond B 268, 2211 Plant phylogenomics: Phytome The goal is to integrate • Organismal phylogeny • Gene family sequence alignment phylogeny • Genetic and physical maps Some uses for Phytome Starting with a chromosome segment • Identify homologous segments • Predict unobserved gene content (candidate QTL) Starting with a gene family • Resolve orthology/paralogy relationships • Identify coevolving families Starting with a species • Explore lineage-specific diversification • Guide comparative mapping wet-work Current pipeline Unigene collections Annotations Protein sequence prediction Homolog identification Protein family clustering Multiple sequence alignment Phytome Phylogenetic inference Lineage specific diversification 1033 173 Arabidopsis 436 Cotton 334 836 696 715 919 Medicago Tomato Rice 152 genes are “single copy” in all four species A tale of two sisters: the ARF and the Aux/IAA gene families Modulate whole plant response to auxin Interact via dimerization • ARFs are transcription factors • Aux/IAAs bind and repress ARFs in the absence of auxin The chromosomal context Diversification of ARFs Diversification of the Aux/IAAs Why the different patterns of diversification? 12% (ARF) vs 40% (Aux/IAA) segmental duplications Presumably reflects differential retention Possible explanations • Dosage requirements • Coevolution with other interacting genes • Regional transcriptional regulation Divergence of duplicated genes Age of duplication Duplicate pairs in yeast and human (Gu et al. 2002, Makova and Li 2003) Appx. 50% of pairs diverge very rapidly Proportion of divergent pairs increases with Ks and Ka • Plateaus at Ka ~0.3 in human In humans, • Immune response genes over-represented among young, divergent pairs • Distantly related pairs with conserved expression tend to be either ubiquitous or very tissue specific Retention of duplicated genes Nonfunctionalization, or loss of one copy • The fate of most pairs Neofunctionalization (NF) • Positive selection on a new mutation can maintain the pair Subfunctionalization (SF) • Mutations that increase the specificity of duplicates can fix due to drift provided that, combined, the two copies provide the functionality of the ancestral gene. Once SF happens, both copies are indispensable and are retained. • One prediction of the model is that SF more likely for tandem than dispersed pairs (due to linkage) Digital expression profiling Massively Parallel Signature Sequencing (MPSS) • Count occurrence of 17-20 bp mRNA signatures • Cloning and sequencing is done on microbeads • Similar to Serial Analysis of Gene Expression (SAGE) “Bar-code” counting reduces concerns of • cross-hybridization • probe affinity • background hybridization Advantages • Accurate counts of low expression genes • Can distinguish expression profiles of duplicate genes AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA extract mRNA from tissue MPSS library construction Brenner et al., PNAS 97:1665-70. mRNA AAAAAAA Convert to cDNA TTTTTTT AAAAAAA Cut w/ Sau3A 5’ - Add standard primer Anneal to beads coated with unique anti-tag (32 bp, complementary to tag on mRNA) Remove 3’ primer and expose single stranded unique tag (digest, 3' 5' exonuclease) GATC TTTTTTT AAAAAAA TTTTTTT AAAAAAA (added by cloning) PCR Add linker TTTTTTT AAAAAAA 3’ - Add unique 32 bp tag and standard primer AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA AAAAAAA MPSS library construction Brenner et al., PNAS 97:1665-70. Sort by FACS to remove ‘empty’ beads The result of the library construction is a set of microbeads. Each bead contains many DNA molecules, all derived from the 3’ end of a single transcript. Beads are loaded in a monolayer on a microscope slide for the sequencing of 17 – 20 bp from the 5’ end. + NNNN 4321 NNNX RS CODEX1 NNXN RS CODEX2 NXNN RS CODEX3 RS CODEX4 XNNN Add adaptors MPSS Sequencing Brenner et al., Nat. Biotech. 18:630-4. Sequence by hybridization 16 cycles for 4 bp 13 bp Repeat Cycle NNNN 8765 ^GNNN ^ Steps of four bases; overhang is shifted by four bases in each round CNNN 4321 9 bp Digest with Type IIS enzyme to uncover next 4 bases RS CODEC4 DECODERED MPSS Sequencing Each bead provides a signature of 17-20 bp Tag # 1 2 3 4 5 6 7 8 9 . . 30,285 Signature Sequence GATCAATCGGACTTGTC GATCGTGCATCAGCAGT GATCCGATACAGCTTTG GATCTATGGGTATAGTC GATCCATCGTTTGGTGC GATCCCAGCAAGATAAC GATCCTCCGTCTTCACA GATCACTTCTCTCATTA GATCTACCAGAACTCGG . . GATCGGACCGATCGACT Total # of tags: # of Beads (Frequency) 2 53 212 349 417 561 672 702 814 . . 2,935 >1,000,000 ATG Two sets of signatures are generated from each sample in different reading frames staggered by two bases TGA Classifying signatures Duplicated: expression may be from other site in genome Potential alternative splicing or nested gene Anti-sense transcript or nested gene? Potential alternative termination Typical signatures Potential anti-sense transcript Potential un-annotated ORF Triangles refer to colors used on our web page: or Class 1 - in an exon, same strand as ORF. Class 2 - within 500 bp after stop codon, same strand as ORF. or Class 3 - anti-sense of ORF (like Class 1, but on opposite strand). or Class 4 - in genome but NOT class 1, 2, 3, 5 or 6. or Class 5 - entirely within intron, same strand. or Class 6 - entirely within intron, anti-sense. or Grey = potential signature NOT expressed Class 0 - signatures found in the expression libraries but not the genome. Core Arabidopsis MPSS libraries sequenced by Lynx for Blake Meyers, U. of Delaware Library Root Shoot Flower Callus Silique TOTAL Signatures sequenced 3,645,414 2,885,229 1,791,460 1,963,474 2,018,785 12,304,362 Distinct signatures 48,102 53,396 37,754 40,903 38,503 133,377 http://www.dbi.udel.edu/mpss Query by • Sequence • Arabidopsis gene identifier • chromosomal position • BAC clone ID • MPSS signature • Library comparison Site includes • Library and tissue information • FAQs and help pages Genome-wide MPSS profile in Arabidopsis Chr. I Chr. II Chr. III Chr. IV Chr. V Of the 29,084 gene models, 17,849 match unambiguous, expressed class 1 and/or 2 signatures Dataset of duplicate pairs Gene families of size two in Arabidopsis classified as • Dispersed (280) • Segmental (149) • Tandem (63) For each pair • Measure similarity/distance in expression profile • Estimate of Ks and KA Expression distance library 2 library 1 library 3 The number of genes with >5 ppm expression in a given number of libraries among the 984 genes in pairs analyzed and among all Arabidopsis genes with MPSS profiles. Libraries 0 1 2 3 4 5 Genes in pairs 153 (15.5%) 124 (12.6%) 73 (7.4%) 93 (9.5%) 109 (11.1%) 432 (43.9%) All genes 4160 (23.3%) 2643 (14.8%) 1727 (9.6%) 1777 (10.0%) 1930 (10.8%) 5612 (31.4%) Asymmetry in levels of expression among libraries within pairs Symmetry of divergence Type of Pair A B C D ________________________________________________________________ Young Dispersed (Ks0.5) 14 61 8 6 15.7% 68.5% 9.0% 6.7% Tandem (Ks0.5) 8 14.3% 29 51.8 10 17.9% 9 16.1% Dispersed (Ks>0.5) 35 18.3% 111 58.1% 24 12.6% 21 11.0% Segmental (All) 31 20.8% 104 69.8% 7 4.7% 7 4.7% Old A: Each copy has higher expression in at least one library B: One copy has higher expression in all libraries that differ and at least two libraries differ C: Copies differ in expression in only one library D: Copies do not differ in expression in any libraries 0.7 normalized distance 0.6 0.5 0.4 D 0.3 S T 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.9 1 synonymous substitution 0.6 normalized distance 0.5 0.4 D S T 0.3 0.2 0.1 dN =0.48+0.37 KA, p<0.0001 0 0.05 0.1 0.15 nonsynonymous substitution 0.2 4000 3500 2500 D 2000 S T 1500 1000 500 0 0.05 0.1 0.15 0.2 0.25 0.35 0.4 nonsynonymous substitution 5 4.5 4 breadth of expression total expression 3000 3.5 3 D 2.5 S T 2 1.5 1 0.5 0 0.05 0.1 0.15 0.2 0.25 nonsynonymous substitution 0.35 0.4 Pairs with small Ks but dissimilar expression profiles. Ks 0.03 Ka <0.01 dup D gene pair AT1G80700 AT1G80980 callus 71 0 flower leaf 59 11 0 1 root 140 8 silique 94 17 0.17 0.05 T AT2G46280 AT2G46290 246 28 210 29 160 1 308 29 80 16 0.20 0.06 T AT2G15400 AT2G15430 4 42 14 128 5 14 5 136 34 18 0.22 0.05 D AT1G36280 AT4G18440 1 40 3 87 9 69 13 69 10 51 0.26 0.05 T AT1G71270 AT1G71300 88 0 56 0 44 0 52 0 107 1 0.27 0.07 T AT3G13290 AT3G13300 20 246 22 245 1 72 1 192 6 77 0.27 0.10 T AT1G29390 AT1G29395 18 0 238 63 89 5 8 0 165 36 0.27 0.06 T AT3G26070 AT3G26080 16 349 169 13 346 41 0 4 524 135 0.28 0.13 D AT3G56190 AT3G56450 216 15 115 0 144 6 239 4 56 1 Pairs with large Ks but similar expression profiles. Ks 0.87 Ka 0.28 dup T gene pair AT3G16220 AT3G16230 callus 16 21 flower leaf 10 57 12 35 root 3 13 silique 19 13 0.89 0.13 D AT3G03660 AT5G17810 14 71 0 0 0 0 0 0 0 0 0.95 0.29 D AT2G41180 AT3G56710 57 75 14 15 78 39 4 3 29 14 0.97 0.28 D AT1G31814 AT5G16320 2 0 39 55 4 10 3 19 0 8 0.98 0.23 D AT5G07230 AT5G62080 0 0 344 288 0 0 0 0 0 0 0.99 0.26 D AT3G22160 AT4G15120 86 34 6 2 10 0 4 0 4 0 A closing thought 1965 • The Ecological Theater and the Evolutionary Play, G. E. Hutchison 2004 • The Chromosomal Theater and the Gene Family Play Phylogenetics has a great deal to contribute to understanding the evolutionary interplay of genome structure and function Dan Brown Brandon Gaut Steven Tanksley Liqing Zhang Jason Phillips Dihui Lu David Remington Jason Reed Tom Guilfoyle Blake Meyers NSF