Integration of data to uncover evolutionary trends and infer protein function: The tale of Rcs1 M. Madan Babu MRC Laboratory of Molecular Biology Cambridge Overview of research Evolution of biological systems Evolution of networks within and across genomes Evolution of transcription factors Evolutionary of transcriptional networks Nuc. Acids. Res (2003) Nature Genetics (2004) J Mol Biol (2006a) Structure and function of biological systems Structure and dynamics of transcriptional networks J Mol Biol (2006b) J Mol Biol (2006c) Uncovering a distributed architecture in networks Methods to study network dynamics Nature (2004) Data integration, function prediction and classification Discovery of transcription factors in Plasmodium Discovery of novel DNA binding proteins C Evolution of a global regulatory hubs H H C Nuc. Acids. Res (2005) Cell Cycle (2006) Rcs1 – regulator of cell size 1 S. cerevisiae - wild type Size of mutant cells are twice that of the parental strain S. cerevisiae - Rcs1 mutant The critical size for budding in the mutant is similarly increased Rcs1 binds specific DNA sequences The following parameters that were used to define cell-size for the Rcs1 mutant were at least 2 Standard deviation (2 s) from the mean values of the wild-type Mother cell-size 874 760 Contour length of mother cell 108 100 Long axis length of mother cell 36 33 Short axis length of mother cell 30 27 Roundness of mother cell 1.29 1.20 Micrographs and data from SCMD Rcs1 is a global regulatory hub – Network analysis I P53 Tigger Dal82 Ime1 Tea Abf1 Tig AT-Hook Ace1 Rcs1 Gcr1 LisH HMG1 Mads Myb Apses Hsf Fkh bHLH Gata Homeo bZip C2H2-Zn C6-Fungal No. of members Rcs1p and Aft2p are global regulatory hubs with an as yet uncharacterized DNA binding domain Distribution of DNA binding domains in yeast transcription factors Transcriptional regulatory network in yeast Sub-network of Rcs1 and Aft2 Aft2p 123 41 Rcs1p 314 Number of target genes regulated How did the paralogous hubs that regulate distinct sets of genes evolve? Relationship to WRKY DNA binding domain – Sequence analysis I ... + . Non-redundant database Candida albicans (ascomycete) Yarrowia lipolytica (ascomycete) Ustilago maydis (basidiomycete) Cryptococcus sp (basidiomycetes) E. cuniculi (microsporidia) Giardia lamblia (diplomonad) Dictyostelium discoideum Entamoeba histolytica Lineage specific expansion in several fungi and is seen in lower eukaryotes WRKY domain (Arabidopsis) + FAR-1 type transposase (Medicago truncatula) Profiles + HMM of this region Non-redundant database Globular region maps to WRKY DNA-binding domain Confirmation of relationship to WRKY DBD – Sequence analysis II Rcs1 (S. cerevisiae) + WRKY DNA-binding Domain from Arabidopsis WRKY4 Gcm1 (Drosophila) Non-redundant database WRKY DNA-binding domain maps to the same globular region S1 S2 S3 S4 JPRED/PHD Multiple sequence alignment of all globular domains Sequence of secondary structure is similar to the WRKY DNA-binding domain and GCM1 protein seen in mouse Homologs of the conserved globular domain constitutes a novel family of the WRKY DNA-binding domain Characterization of the globular domain – structural analysis I Predicted SS of Rcs1 DBD S1 S2 S3 Predicted SS of Rcs1 DBD S4 S1 SS of WRKY4 S1 S2 S3 S2 S3 S4 SS of GCM1 S4 S1 S2 S3 S4 Template structure A. thaliana transcription factor (WRKY4:1wj2:NMR structure) Mus musculus Glial Cell Missing - 1 (GCM-1:1odh:X-ray structure) Both WRKY and GCM1 have similar network of stabilizing interactions Characterization of the globular domain – structural analysis II S1 S2 S3 S4 4 residues involved in metal co-ordination and 10 residues involved in key stabilizing hydrophobic interactions that determine the path of the backbone in the four strands of the GCM1-WRKY domain show a strong pattern of conservation. Core fold of the Rcs1 DBD will be similar to the WRKY-GCM1 domain and may bind DNA in a similar way Classification of WRKY-GCM1 superfamily – Cladistic analysis I S1 S2 S3 S4 C H Zn2+ + H C S1 S2 S3 S4 Template structure Classical WRKY (C) C Insert containing version (I) HxC containing version (HxC) C H C Zn2+ Zn2+ C C C S2 S3 S4 WRKY motif in S1 Short loop between S2 & S3 C WRKY4 S1 C H W S2 S3 H C I S1 S2 H S3 S1 S4 S2 Rcs1 HxC instead of HxH N-terminal helix Short insert between S2 & S3 HxC H C S3 S4 Conserved W in S2 Sequence features F H Zn2+ C W S4 N-terminal helix Conserved W in S4 Large insert between S2 & S3 Far1 H Zn2+ C S1 GCM domain (G) H Zn2+ H FLYWCH domain (F) Mdg S1 S2 S3 S4 Insertion of Zn ribbon between S2 and S3 G Gcm1 Domain context for the different families – network analysis I HxC containing version (HxC) C H Zn2+ C H C Zn2+ H Zn2+ H C W S2 S3 OUT protease MULE Tpase Zn knuckle SMBD Stand alone I Stand alone Tandem e.g. WRKY4 I Tandem C I Zn cluster C e.g. Rcs1 S2 S3 S1 S4 S2 S3 S4 S1 S2 S3 e.g. At2g23500 HxC F G HxC F G e.g. 101.t00020 H H C S4 e.g. Far1 C S1 BED finger S1 Zn2+ Mobile element S4 Stand alone S3 MULE Tpase S2 H C W C S1 H H Zn2+ C C GCM domain (G) e.g. Mod (mdg) Stand alone C FLYWCH domain (F) POZ C Insert containing version (I) Stand alone Classical WRKY (C) e.g. Gcm1 S4 C I TF only TF only TF + TP Phyletic distribution – Comparative genome analysis I HxC F Transcription factor G Transposase Human Fly Higher Eukaryotes GCM1 and FLYWCH versions evolved from an insert containing version that is a transposase Worm Fungi Fungi Entamoeba Lower eukaryotes Slim mould Plants Plants HxC and Insert containing versions are seen as both transcription factors and as transposases Classical version of the WRKY evolved from an insert containing version that is a transposase -explain that there has been multiple transitions from transposase to TFs in the fungal genomes -explain how this could have happened by showing the snapshot of the breakup of selfish elements into two distinct products -explain that the transposase can itself regulate the gene expression of itself Outline of the presentation Rcs1 and aft2 have a distinct version of the WRKY type DNA binding domain Sensitive sequence search reveals that Oryza sativa (monocot) Arabidopsis thaliana (dicot) Medicago truncatula (dicot) Nicotiana tabacum (dicot) Structural equivalences of WRKY-GCM1 domain proteins with Bed and Zn finger WRKY (1wj2) GCM-type WRKY (1odh) Zn C Zn2+ S2 C C H Zn2+ H C S1 H C H S1 S4 S2 C C H C H Zn2+ Zn2+ C S3 Classical Zn-finger (1m36) C C C Bed-finger (2ct5) H H S3 S4 S1 S2 S3 S4 S1 S2 H1 Why Rcs1? While systematically analyzing the genes which gave rise to abnormal cell size, We and the other noted that mutants of Rcs1 give abnormal cell shape. It was known to be an important transcription factor involved in cell size regulation – explain showing graphs and images Independently, during the analysis of the TNET in yeast We looked at the hubs and the DNA binding domains That were present in them. Interestingly, there were two Hubs that did not have any known DNA binding domain Identified in them, but the region which mediates DNA was known – explain showing the family relationship Of the hubs -only two members, and both are hubs -how and when did they evolve? Standard search procedures using Pfam and other databases did not provide any clue about the domain. So we set out to characterize the DNA binding region from Rcs1p and its paralog Aft2p using sensitive sequence search and other computational methods. -show output from Pfam hits WRKY DNA binding domain – Structure analysis I Structural aspects of the DNA binding domain Explain the residues involved in metal chelating -DNA contacting surface -Inserts in the loops -Stabilizing contacts involved WRKY DNA binding domain – Structure analysis II Structure comparisons identify several other Known transcription factors including the GCM protein in eukaryotes -Explain the insert of a zinc ribbon in the loop In fact sequence comparison without the insert can pick these WRKY proteins Classification of WRKY domains – Cladistic analysis I Multiple starting points identified all homologs in the different species This allowed us to classify the sequences into different families Each with a specific feature suggesting common evolutionary relationship Based on shared and derived features of the domains - List the 5 families and point to features involved using a structure template Phylogenetic distribution and domain architecture for the different families - I Phyletic profiles of the different domains points to the possibility that these transcription factors could have evolved from transposases With at least two distinct recruitment into transcription factors. -In plants in one case -In the base of the fungal genomes in the other case Phylogenetic distribution and domain architecture for the different families - II Comparative genomics using the fungal genomes provides the clue for the evolution of these TFs -explain that there has been multiple transitions from transposase to TFs in the fungal genomes -explain how this could have happened by showing the snapshot of the breakup of selfish elements into two distinct products -explain that the transposase can itself regulate the gene expression of itself Comparative genomics using the fungal genomes provides the clue for the evolution of these TFs -extensive recruitment of the transposase in the different fungal lineages -multiple jumps within the fungal lineage -very recent duplication event in the order Saccharomycetales suggest hubs could Evolve rapidly -Candida rbf1 and other TFs independently duplicated and evolved as global regulators Analysis of the gene expression data in plants Since it happened in fungal genomes, we ask how does this behave in the plants. -show the gene expression patterns for the different subfamilies. We see two trends one where divergence has primarily occurred in the expression changes rather than in the protein sequence, and the other in which proteins with the same expression pattern have different binding site residues. -spatio-temporal changes in gene expression -It is experimentally well known that the FLYWCH and the GCM proteins are developmentally important regulatory proteins. So in three lineages there has been recruitment of the transposase into becoming a developmentally important global regulator. Analysis of the gene expression data in plants There are interesting traces of gene expression pattern when we see for the different WRKY containing proteins. TPases are expressed in the root and in the pollen enhancing the possibility of rapidly expanding themselves during evolution. Acknowledgements Aravind group L Aravind S Balaji Lakshminarayan Iyer MtrDRAFT_AC146590g49v2_Mtru_92891293 * I C hGCMa_Hsap_1769820 * C * NtEIG-D48_Ntab_10798760 mod(mdg4)_Dmel_24648712 F 1- 5 CG13845_Dmel_24649011 Homo sapiens I Drosophila melanogaster I I * * I YALI0C00781g_Ylip_50547661 I C26E6.2_Cele_32565510 F I HxC I Ci-ZF-1_Cint_93003122 F 1- 5 KIAA1552_Hsap_10047169 LOC_Os11g31760_Osat_77551147 C20orf164_Hsap_13929452 C CHGG_00311_Cglo_88184608 I LOC411361_Amel_66547010 F HxC CHGG_08318_CGLO_88179597 I I T24C4.2_Cele_17555262 * UM03656.1_Umay_71019145 I Caenorhabditis elegans AN6124.2_ANID_67539908 FAR1_Atha_18414374 I AT4g19990_Atha_7268794 C F G WRKY41_Osat_46394336 At2g23500_Atha_3242713 * I F54C4.3_Cele_3790719 * TTR1_Atha_30694675 I gcm_Dmel_17137116 MtrDRAFT_AC126008g21v1_Mtru_92876827 C YALI0A02266g_Ylip_50543034 T24C4.7_Cele_17555272 G * I I HxC Fungi Plants HxC C C Animals WRKY58_Atha_22330782 At2g34830_Atha_27754312 I mutA_Ylip_49523824 I Afu2g08220_Afum_71000950 AFT2_Scer_6325054 Encephalitozoon cuniculi ECU05_0180_Ecun_19173554 Ciliates HxC Apicomplexa I Giardia lamblia I 101.t00020_Ehis_67474280 GLP_9_36401_35940_Glam_71071693) Entamoeba histolytica C C C Classical WRKY GCM-type G WRKY C I F Insert-containing HxC WRKY FLYWCH-type WRKY HxC-type WRKY MULE transposase C Dictyostelium dd_03024_Ddis_28829829 discoideum GLP_79_64671_67418_Glam_71077115) Plant specific Zn-cluster Zinc knuckle BED finger SWIM domain Plant-specific mobile domain PHD finger C2H2 finger LRR STAND ATPase Isochoris matase Plant specific N-all-beta TIR domain AT-hook OTU POZ Expression profiles of WRKY-GCM1 domain proteins in Arabidopsis Gene expression profiles for the light exposure conditions in Arabidopsis thaliana + WRKY proteins show light specific expression 15 Far1-type proteins 15 Far1-type proteins 5 WRKY domain Proteins with TIR/LRR 5 WRKY domain Proteins with TIR/LRR 60 WRKY domain containing proteins b + 40 HxC type WRKY domain proteins 40 HxC type WRKY domain proteins WRKY proteins show tissue specific expression Gene expression profiles for the developmental stages in Arabidopsis thaliana 60 WRKY domain containing proteins a ot tem S Ro f a Le ex er al Ap low or F Fl ans g or Se s ed ess ous rkn tinu a D Con ght li lse Pu ht lig Relationship between Rcs1p and Aft2p homologs Multiple independent evolution of TFs from Transposons UM03656.1 Umay 71019145 CAGL0H03487G CGLA 49526254 CAGL0G09042G CGLA 49526062 CaO19.2272 Calb 68482460 DEHA0F25124g Dhan 50425555 KLLA0D03256g Klac 50306475 AFL087C AGOS 44984319 ORFP Sklu Contig1830.2 kluyveri Kwal 24045 waltii ORFP Scas Contig720.21 castelli ORFP Skud Contig2057.12 kudriavzeii ORFP 7853 mikatae * ORFP 8601 paradoxus RCS1 SCER 51830313 ORFP Scas Contig690.14 castelli Rcs1 Aft2p cluster ORFP Skud Contig1659.3 kudriavzeii Animals Rbf1 cluster ORFP 21513 mikatae Plants * ORFP 22109 paradoxus Entamoeba AFT2 SCER 6325054 Fungi AAL026Wp Agos 44980144 UM03656.1 Umay 71019145 CHGG 06963 CGLO 88178242 CHGG 06785 CGLO 88182698 CHGG 09478 CGLO 88177996 CHGG 00175 CGLO 88184472 CHGG 10902 CGLO 88175616 FG05699.1 Gzea 46122643 NCU06551.1 Ncra 85106835 NCU05145.1 Ncra 85081010 YALI0F07128g Ylip 50555399 MG05295.4 Mgri 39939890 FG04147.1 Gzea 46116610 NCU07855.1 Ncra 85109845 MG06795.4 Mgri 39977821 NCU08168.1 Ncra 85093270 CHGG 09951 CGLO 88176079 CHGG 08318 CGLO 88179597 NCU04492.1 Ncra 32406464 FG09606.1 Gzea 46136181 NCU06975.1 Ncra 85108658 CHGG 05063 CGLO 88180976 HOP78 FOXY 30421204 CHGG 00311 CGLO 88184608 CIMG 00825 CIMM 90305840 AN6124.2 Anid 67539908 ISOCHOR AFUM 71001046 CNC00740 CNEO 57225606 CNBH2400 Cneo 50256416 AN0859.2 ANID 67517161 YALI0A16269g Ylip 50545173 CaO19 12424 Calb 68467239 DEHA0E17127g Dhan 50422877 RBF1P CALB 2498834 DEHA0A05258g Dhan 50405817 CaO19.2272 Calb 68482460 DEHA0F25124g Dhan 50425555 CAGL0H03487G CGLA 49526254 AFL087C AGOS 44984319 KLLA0D03256g Klac 50306475 CAGL0G09042G CGLA 49526062 RCS1 SCER 51830313 AFT2 SCER 6325054 YALI0A05313g Ylip 50543230 YALI0A02266g Ylip 50543034 Mutyl Ylip 50545163 YALI0C17193g.c Ylip 50548927 Mutyl.c Ylip 50545161 YALI0C00781g.d Ylip 50547661 YALI0C00781g.a Ylip 50547661 YALI0C00781g.b Ylip 50547661 YALI0C00781g.c Ylip 50547661 YALI0C17193g.a Ylip 50548927 Mutyl.a Ylip 50545161 YALI0D22506g Ylip 50551361 Mutyl.b Ylip 50545161 YALI0C17193g.b Ylip 50548927 MG07557.4 Mgri 39972511 MG09992.4 Mgri 39965911 101.T00020 EHIS 67474280 4.T00052 EHIS 67483840 FAR1 ATHA 18414374 AT2G27110 ATHA 18401324 AT2G43280 ATHA 30689328 AT4G38180 ATHA 15233732 AT3G59470 ATHA 18411179 AT5G28530 ATHA 22327146 AT1G52520 ATHA 15219020 AT1G80010 ATHA 15220043 C20ORF164 HSAP 13929452 LOC428161 GGAL 50759053 T24C4.2 CELE 17555262 SJCHGC04823 SJAP 56758936 6330408A02RIK MMUS 50053999 LOC374920 HSAP 27694337 Transcriptional network involving Aft2p and Rcs1p Aft2p Aft2p Rcs1p Rcs1p 123 41 314 Number of target genes regulated Conclusion Integration of different types of experimental data allowed us to Identify the DNA binding domain in Rcs1 Sequence Structure Expression Interaction