MICROBIAL GENOME ANNOTATION Loren Hauser Miriam Land Yun-Juan Chang Frank Larimer Doug Hyatt Cynthia Jeffries 1 NEB Educational Support http://www.neb.com/nebecomm/course_support.asp? 2 Why study Computational Biology and Bioinformatics? DNA sequencing output is growing faster than Moore’s law! 1 Illumina sequencing machine = 0.5 Tbp/week There are hundreds of these and thousands of other sequencing machines around the world. New sequencing technology will conceivably allow sequencing a human genome for less than $1K in less than 1 day! 3 Why study Medical Bioinformatics? In the near future, most cancer diagnostics will involved DNA or RNA sequencing! In the near future, every baby born in the developed world will have their genome sequenced. Protecting privacy and your doctors ability to use that information are the only real impediments! Hospitals are using DNA sequencing to track antibiotic resistant bacterial infections. 4 DOE Undergraduate Research in Microbial Genome Analysis and Functional Genomics http://www.jgi.doe.gov/education 5 Why Study Microbial Genomes? Large biological mass (50% of total) photosynthetic (Prochlorococcus) fix N2 gas to NH3 (Rhodopseudomonas) NH3 to NO2 (Nitrosomonas) bioremediation (Shewanella, Burkholderia) pathogens, BW (Yersinia pestis - plague) food production (Lactobacillus) CH4 production (Methanosarcina) H2 production (Rhodopseudomonas) 6 Example of Current Microbial Genome Projects UC Davis – FDA funded 100K bacterial genomes project associated with food. 5 years = 20K per year / 200 days/year = 100 genomes/day! 7 Web Resources and Contact Information http://genome.ornl.gov/microbial/ http://www.jgi.doe.gov/ http://genome.jgi-psf.org/ http://www.jcvi.org/ http://www.ncbi.nlm.nih.gov/ http://www.sanger.ac.uk/ http://www.ebi.ac.uk/ ftp://ftp.lsd.ornl.gov/pub/JGI artemis ready files for each scaffold = (feature table plus fasta sequence file) Contact: landml@ornl.gov; hauserlj@ornl.gov 8 9 Evolution of Sequencing Throughput Sequencing Technology Maxam and Gilbert Manual Sanger Automated Sanger (96 lanes/gel) Automated Sanger (384 capilaries) 454 sequencing (new titanium) Solexa (Illumina) Solexa (Illumina) PacBio realtime sequencing Samples/run bp/sample runs/week 1 100 5 5 400 5 100 500 5 400 600 10 1,000,000 400 5 300,000,000 75 1 1,000,000,000 200 1 100,000,000 1000 10 bp/week year 500 1977 10000 1985 250000 1995 2400000 2002 2E+09 2009 2.25E+10 2009 2.00E+12 ?2010 1E+12 ?2010 Sequenced Microbial Genomes ARCHAEAL GENOMES 159 FINISHED; 218 IN PROGRESS BACTERIAL GENOMES 3363 FINISHED; 11831 IN PROGRESS ENVIRONMENTAL COMMUNITIES > 50,000 samples (see MGRast) as of Sept 6, 2012 http://www.expasy.ch/alinks.html http://www.genomesonline.org http://metagenomics.anl.gov/ 11 Published Genomes Nitrosomonas europaea - J.Bac. 185(9):2759-2773 (2003) Prochlorococcus MED4 & MIT9313 - Nature 424:1042-1047 (2003) Synechococcus WH8102 - Nature 424:1037-1042 (2003) Rhodopseudomonas palustris - Nat. Biotech. 22(1):55-61 (2004) Yersinia pseudotuberculosis - PNAS 101(22):13826-31 (2004) Nitrobacter winogradskyi – Appl. Envir. Micro. 72(3):2050-63 (2006) Nitrosococcus oceani - Appl. Envir. Micro. 72(9):6299-315 (2006) Burkholderia xenovorans – PNAS 103(42):15280-7 (2006) Thiomicrospira crunogena – PLoS Biology 4(12):e383 (2006) Nitrosomonas eutropha C91 – Env. Micro. 9(12):2993-3007 (2007) Sulfuromonas denitrificans – Appl. Envir. Micro. 74(4):1145-56 (2008) Nitrosospira multiformis -- Appl. Envir. Micro. 74(11):3559-72 (2008) Nitrobacter hamburgensis -- Appl. Envir. Micro. 74(9):2852-63 (2008) Saccharophagus degradans – PLoS Genetics 4(5):e1000087 (2008) R. palustris – 5 strain comparison – PNAS 105(47):18543-8 (2008) L. rubarum and L. ferrodiazotrophum – Appl. Envir. Micro. (in press) 12 Basic Annotation Impacts Design of oligonucleotide arrays Design & prioritize protein expression constructs Design & prioritize gene knockouts Assessment of overall metabolic capacity Database for proteomics Allows visualization of whole genome 13 Additional Analysis Impacts Revised functional assignments based on domain fusions, functional clustering, phylogenetic profile Regulatory motif discovery Operon and regulon discovery Regulatory and protein association network discovery 14 Microbial Annotation Genome Pipeline Scaffolds or contigs Simple repeats Prodigal Complex Repeats Model correction tRNAs Final Gene List InterPro PRIAM Blast COGs TMHMM SignalP rRNA, Misc_RNAs GC Content, GC skew Function call Web Pages Feature table 15 Prodigal (Prokaryotic Dynamic Programming Genefinding Algorithm) Unsupervised: Automatically learns the statistical properties of the genome. Indifferent to GC Content: Prodigal performs well irrespective of the GC content of the organism. Draft: Prodigal can train on multiple sequences then analyze individual draft sequences. Open Source: Prodigal is freely available under the GPL. Reference: Hyatt D, Chen GL, Locascio PF, Land ML, Larimer FW, Hauser LJ. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010 Mar 8;11(1):119. (Highly Accessed) 16 G+C Frame Plot Training Takes all ORFs above a specified length in the genome. Examines the G+C bias in each frame position of these ORFs. Does a dynamic programming algorithm using G+C frame bias as its coding scoring function to predict genes. Takes those predicted genes and gathers dicodon usage statistics. 17 Gene Prediction Dicodon usage coding score Length factor added to coding score (GCcontent-dependent) Coding/noncoding thresholds sharpened (starts downstream of starts with higher coding get penalized by the difference). Dynamic programming to put genes together. Bonuses for operon distances, larger bonus for -1/-4 overlaps. Same strand overlap allowed (up to 60 bases). Opposite strand -->3'r 5'f<- allowed (up to 250 bases) 18 Start Site Scoring Shine Dalgarno Motif Examines initially predicted genes and gathers statistics on the starts (RBS motifs, ATG vs GTG vs TTG frequency) Moves starts based on these discoveries. Gathers statistics on the new set of starts and repeats this process until convergence (5-10 iterations). RBS motifs based on AGGAGG sequence, 3-6 base motifs, with one mismatch allowed in 5 base or longer motifs (e.g. GGTGG, or AGCAG). Does a final dynamic programming with the start scoring function. 19 Start Site Scoring Other Motifs If Shine-Dalgarno scoring is strong, use it – this accounts for ~85% of genomes. If Shine-Dalgarno scoring is weak, look for other motifs If a strong scoring motif is found, use it (example GGTG in A. pernix) If no strong scoring motif is found, use highest score of all found motifs (example – Crenarchaea, Tc and Tl start sites are the same, but internal operon genes use weak Shine-Dalgarno motifs) 20 Annotated Gene Prediction 21 Prodigal Scoring 22 Gene Prediction Problems – Pseudogenes 23 Pseudogenes – Internal deletion 24 Pseudogenes – Premature stop codon 25 Pseudogenes – N-terminal deletion 26 Pseudogenes – Transposon insertion 27 Pseudogenes – Multiple frameshifts 28 Pseudogenes – Premature Stop and Frameshift 29 Pseudogenes – Dead Start Codon 30 31 GENE PAGE 32 33 34 35 ORGANISM’S (PSYC) COGS LIST Contig Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Contig1 Gene 1 3 4 6 8 10 11 14 15 16 17 19 20 23 24 25 26 27 28 29 29 30 31 33 34 35 36 37 38 39 40 41 42 43 45 46 47 51 52 53 57 58 59 59 60 61 62 64 65 66 67 68 69 70 71 74 76 77 78 78 Num Prot ------------------------------------------------------------- Group I E S K R L COG COG1211 COG0137 COG1376 COG0583 COG0628 COG0593 J G COG0172 COG0021 L L G R P P COG0551 COG0507 COG0057 COG1451 COG0168 COG0569 M COG2885 M R R R COG1538 COG1538 COG2274 COG1566 C R Q P S R P R COG2010 COG3019 COG2132 COG3667 COG3544 COG0491 COG2217 COG1826 S G O J J M S O O C O O F R I R M S P P E L O COG1937 COG2814 COG0435 COG0261 COG0211 COG2834 COG4399 COG1138 COG0526 COG0526 COG3088 COG4235 COG0563 COG1949 COG1502 COG0790 COG1519 COG1385 COG1840 COG1178 COG3842 COG0188 COG0625 R T COG0515 COG0515 Gene Name COG Des c ription Sc ore E-Value Category 4-diphos phoc Is pD y tidy l-2-methy l-D-erithritol 86 9.00E-19 s y nthas Lipid e metabolis m Argininos uc ArgG c inate s y nthas e 565 1.00E-162 Amino ac id trans Unc harac teriz ErfK ed protein c ons erv 61 ed 4.00E-11 in bac teria Func tion unk now Trans c riptional Ly s R regulator 123 2.00E-29 Trans c ription Predic ted permeas PerM e 145 6.00E-36 General func tion ATPas e inv DnaA olved in DNA replic 104 ation 8.00E-24 initiation DNA replic ation No COG Sery l-tRNASerS s y nthetas e 557 1.00E-160 Trans lation ribos Trans k etolas Tk e tA 992 0 Carbohy drate tra No COG Zn-finger domain TopA as s oc iated with 38 topois 9.00E-04 omeras DNA e ty replic pe I ation ATP-dependent Rec D ex oDNAs e (ex 61 onuc 3.00E-10 leas e V) DNA alphareplic s ubunit ation - h Gly c eraldehy GapA de-3-phos phate 344 dehy drogenas 7.00E-96e/ery Carbohy thros drate e-4-pho tra Predic ted metal-dependent COG1451 hy 104 drolas 3.00E-24 e General func tion Trk -ty pe K+ Trk trans G port s y s tems 181 membrane 1.00E-46 c omponents Inorganic ion tra K+ trans port Trk sA y s tems NAD-binding 125 4.00E-30 c omponent Inorganic ion tra No COG Outer membrane OmpA protein and 113 related 1.00E-26 peptidogly Cell c an-as envelope s oc iate bio No COG Outer membrane TolC protein 114 2.00E-26 Cell envelope bio Outer membrane TolC protein 114 2.00E-26 Cell envelope bio ABC-ty pe bac SunT terioc in/lantibiotic 410ex 1.00E-115 porters c ontain Defens an e mec N-termi han Multidrug res EmrA is tanc e efflux pump 82 7.00E-17 Defens e mec han No COG No COG Cy toc hrome Cc ccmonoA and diheme 38 v 3.00E-04 ariants Energy produc tio Predic ted metal-binding COG3019 protein 136 1.00E-33 General func tion Putative multic SufIopper ox idas es 271 9.00E-74 Sec ondary meta Unc harac teriz Pc ed oB protein involv 148 ed in 6.00E-37 c opper res Inorganic is tanc e ion tra Unc harac teriz COG3544 ed protein c ons erv 43 ed 8.00E-06 in bac teria Func tion unk now Zn-dependent GloB hy drolas es inc luding 100 2.00E-22 gly ox y lasGeneral es func tion Cation trans Zport ntA ATPas e 754 0 Inorganic ion tra Sec -independent TatA protein s ec retion 59 4.00E-11 pathway c Intrac omponents ellular traffi No COG Unc harac teriz COG1937 ed protein c ons erv 49 ed 5.00E-08 in bac teria Func tion unk now Arabinos e efflux AraJ permeas e 57 2.00E-09 Carbohy drate tra Predic ted glutathione ECM4 S-trans507 feras 1.00E-145 e Pos ttrans lationa Ribos omal RplU protein L21 126 3.00E-31 Trans lation ribos Ribos omal RpmA protein L27 119 4.00E-29 Trans lation ribos Outer membrane LolA lipoprotein-s 122 orting 3.00E-29 protein Cell envelope bio Unc harac teriz COG4399 ed protein c ons erv 41 ed 2.00E-04 in bac teria Func tion unk now Cy toc hrome Cc cmF biogenes is fac589 tor 1.00E-169 Pos ttrans lationa Thiol-dis ulfide Trx is A omeras e and thioredox 49 2.00E-07 ins Pos ttrans lationa Thiol-dis ulfide Trx is A omeras e and thioredox 49 2.00E-07 ins Pos ttrans lationa Unc harac teriz Cc ed mH protein involv 157 ed in 4.00E-40 bios y nthes Pos is ttrans of c -ty lationa pe c y t Cy toc hrome COG4235 c biogenes is fac103 tor 3.00E-23 Pos ttrans lationa Adeny late k Adk inas e and related168 k inas 3.00E-43 es Nuc leotide trans Oligoribonuc Orn leas e (3'->5' ex oribonuc 293 7.00E-81 leas e) RNA proc es s ing Phos phatidy Cls ls erine/phos phatidy 230 lgly 1.00E-61 c erophos Lipid phate/c metabolis ardiolipin m FOG: TPR COG0790 repeat SEL1 s ubfamily 56 4.00E-09 General func tion 3-deox y -D-manno-oc KdtA tulos onic 330 -ac id2.00E-91 trans feras Cell e envelope bio Unc harac teriz COG1385 ed protein c ons120 erved 1.00E-28 in bac teria Func tion unk now ABC-ty pe Fe3+ AfuA trans port s y s 164 tem periplas 1.00E-41 mic Inorganic c omponent ion tra ABC-ty pe Fe3+ ThiP trans port s y s 287 tem permeas 1.00E-78 e Inorganic c omponent ion tra ABC-ty pe s PotA permidine/putres c 318 ine trans 5.00E-88 port s y Amino s tems ac AT id Pas trans e Ty pe IIA topois Gy rA omeras e (DNA 941 gy ras e/topo 0 II DNA topois replic omeras ation e I GlutathioneGs S-trans t feras e 94 7.00E-21 Pos ttrans lationa No COG Serine/threonine SPS1protein k inas e 83 2.00E-17 General func tion Serine/threonine SPS1protein k inas e 83 2.00E-17 General func tion 36 Taxonomic Distribution of Top KEGG BLAST Hits 37 Frequency distance distributions Salgado et al. PNAS (2000) 97:6652 Fig. 2 38 Frequency distance distributions Salgado et al. PNAS (2000) 97:6652 Fig. 3b 39 Branched Chain Amino Acid Transporter family Organism Nostoc punctiforme Trichodesmium erythraeum Helicobacter pylori J99 Helicobacter pylori 26695 Campylobacter jejuni subsp. jejuni NCTC 11168 Geobacter metallidurans Desulfovibrio desulfuricans Escherichia coli K12 Escherichia coli O157:H7 EDL933 Buchnera sp. APS Pseudomonas aeruginosa, PAO1 Pseudomonas fluorescens Pseudomonas syringae Psychrobacter Vibrio cholerae O1 biovar eltor str. N16961 Yersinia pestis, CO92 Yersinia pseudotuberculosis Haemophilus influenzae Rd KW20 Pasteurella multocida subsp. multocida str. Pm70 Xylella fastidiosa (3 strains) Azotobacter vinlandii Psychrobacter Burkholderia fungorum Burkholderia mallei Burkhoderia pseudomallei Ralstonia metallidurans Ralstonia eutropha Nitrosomonas europaea Neisseria meningitidis MC58 Neisseria meningitidis Z2491 Caulobacter crescentus Mesorhizobium loti Agrobacerium tumefaciens Bradyrhizobium japonicum Brucella melitenis Brucella suis Sinorhizobium meliloti Rickettsia conorii Rickettsia prowazekii Rhodobacter sphaerodes Rhodospirillum rubrum Rhodopseudomonas palustris Cyano Cyano epsilon epsilon epsilon delta delta gamma gamma gamma gamma gamma gamma gamma gamma gamma gamma gamma gamma gamma gamma gamma beta beta beta beta beta beta beta beta alpha alpha alpha alpha alpha alpha alpha alpha alpha alpha alpha alpha JGI JGI COG COG COG JGI JGI COG COG COG COG JGI JGI JGI COG COG JGI COG COG COG, JGI JGI JGI JGI JGI JGI JGI COG COG COG COG COG COG JGI JGI JGI ATPase ATPase Permease PBP COG0410 COG0411 COG0559 COG0683 3 3 6 4 2 1 3 6 0 0 0 0 0 0 0 0 1 1 2 2 1 1 2 1 2 2 4 4 1 1 2 2 1 1 2 2 0 0 0 0 3 3 7 4 3 3 5 3 7 4 8 5 0 0 1 0 0 0 1 0 2 2 4 2 2 2 4 2 0 0 0 0 0 0 0 0 0 0 0 0 3 4 8 2 0 0 0 0 22 20 34 29 6 6 11 8 7 7 13 10 9 8 16 12 18 19 36 28 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 7 16 10 7 7 15 9 27 26 50 59 7 7 12 13 6 6 12 11 5 5 12 8 0 0 1 0 0 0 1 0 6 6 12 5 6 7 13 9 20 20 40 38 40 Probable Ancient Gene (Liv Operon) 41 Branched Chain Amino Acid Transporter family – Rhodopseudomonas palustris Target ID Description Putative Ligand RPA0985 putative branched-chain amino acid transport system substrate-binding protein branched chain AAs RPA4029 RPA4648 possible branched-chain amino acid ABC transport system substrate-binding protein possible ABC transporter binding protein component branched chain AAs Thermal Shift Assay Binding Ligand 4-Hydroxybenzoate, Benzoate, Salicylate, Benzaldehyde 4-Hydroxybenzoate, pCoumarate Δ Tm °C for 1000 uM Ligand OR (100uM ) Ligand Tm(°C) No Ligand 29.0,13.5, 2.5, 2.0 56.5 17.0, 2.0 58.6 spermidine/putrescine p-Coumarate 2.0 55.5 RPA1250 amide-urea binding protein branched chain AAs Urea 5.0 63.0 RPA1789 putative branched-chain amino acid transport system substrate-binding protein branched chain AAs p-Coumarate 7.0 67.0 branched chain AAs Urea 6.0 59.5 branched chain AAs Ala, Gly,Ser, Met, Leu, Cys 11.5, 6.5, 4.5, 2.5, 2.0, 2.0 77.5 nitrate/taurine Malate 4.0 52.5 amino acids, prefers polar aas Met, Cys, His 10.0, 6.5, 3.5 63.0 13.0, (6.0, 2.0 ) 61.5 6.0, 3.0, 3.0, 2.0, 2.0 52.0 RPA3669 RPA3810 RPA2043 RPA2628 RPA0668 RPA1741 RPA2193 RPA3486 RPA2499 putative urea short-chain amide or branched-chain amino acid uptake ABC transporter periplasmic solute-binding protein precursor putative periplasmic binding protein of ABC transporter putative ABC transporter, periplasmic substratebinding protein polar amino acid ABC transport substrate-binding protein, aapJ-2 (aapJ-2) putative ABC transporter subunit, substrate-binding component possible branched-chain amino acid transport system substrate-binding protein putative ABC transporter, perplasmic binding protein, branched chain amino acids putative branched-chain amino acid transport system substrate-binding protein possible ABC transporter, periplasmic protein branched chain AAs branched chain AAs 4-Hydroxybenzoate, Salicylate, Benzaldehyde Met, Leu, Malate, Gly, Pro branched chain AAs Glutarate 5.0 64.5 branched chain AAs Glutarate 3.0 44.5 nitrate/taurine or aliphatic sulfonates Asn 7.0 53.5 42 Example of Lateral Transfer 43 Transporter Gene Loss in Yersina Pestis 36 Genes involved in transport from YPSE are nonfunctional in YPES 13 lost due to frameshifts 11 lost due to deletions 6 lost due to IS element insertions 4 (2 pair) lost due to recombination causing deletions and frameshifts 2 lost due to premature stop codons 44 45 Nostoc punctiforme Signal Transduction Histidine Kinases 46 Nostoc punctiforme Signal Transduction Histidine Kinases 47 Nostoc punctiforme Signal Transduction Histidine Kinases Gene # R1448 R1449 R1550 R1597 R1685 R1759 R1760 R1778 R1798 R1868 R2035 R2209 R2262 R2263 R2268 R2271 R2272 R2375 R2408 R2421 R2485 R2901 R2903 R2909 R3010 R3052 aa# 374 444 595 1042 1559 706 595 451 1098 713 1080 430 657 740 709 504 1801 1211 928 421 530 629 1116 103 210 475 COG COG0642 COG0642 COG0642 COG4191 COG0642 COG4251 COG0642 COG5002 COG0642 COG0642 COG0642 COG0642 COG0642 COG0642 COG0642 COG4191 COG3899 COG5278 COG4585 COG4585 COG0642 COG0642 COG4251 COG4251 COG0642 COG2205 N-term. (TM) N-term. RRR 1 1 unk. (2) Other domain PAS/PAC GAF(PHY) 1 Chase/1 Hpt unk. (2) unk.(1) unk. (3) 1 HAMP + TM 1 4 (3) 3(1) 2 (1) 1 1 1 5 1 1 1 1 2 1 1 Prt. Kin. 1 Chase 1 Cache (sp) unk. (1) unk. (4) unk. 1 1 unk. (1) 1 1 1 3 1 1 2 (1) 1 HisKA HATPase 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1? 1 1 1 HisKA_3 1 HisKA_3 1 1 1 1 1 1 1 0 0.5 1 0 1 1 C-term. RRR Operon structure K1448/K1449 K1448/K1449 2 1 1 1 1 1 2 RRR1757/RRR1758/K1759/K1760 RRR1757/RRR1758/K1759/K1760 K1778/WHTH1779 K-R1798/K-F1799/LuxR-F1800 K-R2035/cNMP-F2036 2262-9 2262-9 2262-9 K2271/PK&K2272 K2271/PK&K2272 3 LuxR2407/K2408 LuxR2420/K2421 1 K2901/RRR2902/K2903 K2901/RRR2902/K2903 48 Nostoc punctiforme Signal Transduction Histidine Kinases 169 12 3 154 2 1 12 3 3 6 3 1 23 15 66 46 61 59 21 64 82 3 1 1 1 2 5 8 TM RRR predicted genes total pseudogenes genes with sensors but no kinase domain (do these work with the genes with no sensor domains - not in the same operon) functional Signal Transduction Histidine Kinases with 2 kinase domains (fused genes? or a 1 gene cascade?) with an Adenylate Cyclase domain with a Ser/Thr Protein Kinase domain, a COG3899 domain, 1 or more GAF domains, and possibly other domains with Hpt domain with CBS domains with Chase domains with Cache domains with an Amino acid transporter as a sensor? domain with a N-terminal RRR domain with only a N-terminal RRR as a sensor? domain with 1 or more RRR domains (86 RRR domains) with 1 or more C-terminal RRR domains (ie. Hybrid kinases) with 1 or more PAS/PAC domains (147 PAS/PAC domains total) with 1 or more GAF or Phytochrome domains (96 total - 38 phytochromes) with HAMP domains (34 total) with unknown N-terminal sensor domains with multiple N-terminal sensor domains with no sensor domain (do these work with the genes with no kinase domains - not in the same operon) with large C-terminal unknown domain with N-terminal RRR & WHTH (fused genes?) cNMP binding sensor domain with HisKA_2 type dimerization/autophosphorylation domains with HisKA_3 type dimerization/autophosphorylation domains putative operons with common (bidirectional) promoter Transmembrane alpha helical domain response regulator receiver domain (Phospho accepting Asp containing domain) 49 Nostoc punctiforme Regulatory Proteins 570 Regulatory Proteins Comments/Pseudogenes 201 Transcription/Elongation/Termination Factors 14 9 0 0 2 2 1 Sigma Factors Cyanobacterial Sigma Factors Sigma-54 (RpoN) Sigma 32 (RpoH) Sigma 28 (Flagella/Sporulation) Sigma-24 (RpoE/FecI) (ECF subfamily) Unknown Sigma factor (ECF subfamily) 17 1 8 1 1 5 1 Anti/Anti-Anti Sigma Factors Anti-Sigma regulatory factor (Ser/Thr protein kinase and phosphatase) Anti-Sigma-factor antagonist (STAS) domain protein Anti-Sigma-factor antagonist (STAS) and sugar transfersase Predicted transmembrane transcriptional regulator (anti-sigma factor) Putative Anti-Sigma regulatory factor (Ser/Thr protein kinase) Sigma 54 modulation protein/ribosomal protein S30EA 3 1 1 1 Termination/Antitermination Factors NusA antitermination factor NusB antitermination factor NusG antitermination factor 2 sets of pseudogenes: pNPAR018 truncated by transposase; pNPAR022, 3, 4 ar 1 set of pseudogenes: NpR2325/6 S1 RNA binding domain:KH domain / RNA binding 0 Elongation Factors 0 GreA/GreB family elongation factors 167 3 1 2 1 1 6 1 1 1 5 1 5 Transcription factors Ferric uptake regulator (FUR) family Negative regulator of class I heat shock protein phage shock protein A, PspA Phosphate uptake regulator, PhoU Plasmid maintenance system antidote protein Predicted transcriptional regulator SOS-response transcriptional repressor, LexA Putative transcriptional acitvator, Baf Transcriptional Regulator, AbrB family Transcriptional Regulator, AraC family Transcriptional Regulator, AraC family with Methyltransferase activity Two Component Transcriptional Regulator, AraC family 4 different COGs 50 Burkholderia xenovorans Regulatory Proteins 946 Regulatory Proteins Comments 704 Transcription/Elongation/Termination Factors 22 4 2 2 1 12 1 Sigma Factors Sigma 70 (RpoD) Sigma-54 (RpoN) Sigma 32 (RpoH) Sigma 28 (Flagella/Sporulation) Sigma-24 (RpoE/FecI) (ECF subfamily) Unknown Sigma factor (ECF subfamily) 13 1 1 1 2 4 1 1 1 1 Anti/Anti-Anti Sigma Factors Anti Sigma-E protein, RseA, Burkholderiaceae specific Anti-Sigma regulatory factor (Ser/Thr protein kinase and phosphatase) Anti-Sigma(ECF) factor, ChrR Anti-Sigma-factor antagonist (STAS) domain protein Predicted transmembrane transcriptional regulator (anti-sigma factor) Putative Anti-Sigma regulatory factor (Ser/Thr protein kinase) Putative Anti-Sigma-28 factor, FlgM Putative Sigma E regulatory protein, MucB/RseB Sigma-54 modulation protein 6 1 2 1 1 1 Termination/Antitermination Factors transcription termination factor Rho Response regulator receiver (CheY) and ANTAR domain protein NusA antitermination factor NusB antitermination factor NusG antitermination factor also called ribosomal protein S30AE Cold-shock DNA-binding domain(related to S1 RNA binding domain) ANTAR = RNA binding, anti-termination S1 RNA binding domain:KH domain / RNA binding 3 Elongation Factors 3 GreA/GreB family elongation factors 660 7 1 2 1 1 Transcription factors Cold-shock DNA-binding domain protein Possible Ferric uptake regulator (FUR) family Ferric uptake regulator (FUR) family Negative regulator of class I heat shock protein Negative transcriptional regulator 51 Regulatory Protein Identification Scheme Number Category Product Description COG1 COG2 InterPro COG0840 COG0840 Pfam Smart MCPsignal MA MCPsignal and Tar MCPsignal MCPsignal and Cache MCPsignal MCPsignal MA and TarH MA MA MA and GAF MA and GAF 5 5 5 5 5 5 5 5 5 Chemotaxis Signal Possible Transduction Bacterial chemotaxis sensory transducer Chemotaxis Signal Bacterial Transduction chemotaxis sensory transducer Chemotaxis Signal Bacterial Transduction chemotaxis sensory transducer Chemotaxis Signal Bacterial Transduction chemotaxis sensory transducer, TarH (aspartate) sensor Chemotaxis Signal Bacterial Transduction chemotaxis sensory transducer, Pas/Pac sensor Chemotaxis Signal Bacterial Transduction chemotaxis sensory transducer, Cache sensor Chemotaxis Signal Bacterial Transduction chemotaxis sensory transducer, GAF sensor Chemotaxis Signal Bacterial Transduction chemotaxis sensory transducer, Phytochrome sensor Chemotaxis Signal Bacterial Transduction chemotaxis sensory transducer, Phytochrome sensor 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 Chemotaxis Signal CheW Transduction protein COG0835 CheW Chemotaxis Signal CheW Transduction protein IPR002545 Chemotaxis Signal Two component Transduction CheW protein IPR002545 and IPR001789 Chemotaxis Signal Possible Transduction CheA Signal Transduction Histidine Kinases (STHK), weak homolog, COG0643 no good domain identification Chemotaxis Signal Possible Transduction CheA Signal Transduction Histidine Kinases (STHK) COG0643 HATPase_c Chemotaxis Signal CheA Signal Transduction Transduction Histidine Kinases (STHK) COG0643 IPR008207 and IPR003594 Chemotaxis Signal CheA Signal Transduction Transduction Histidine Kinases (STHK) IPR002545 and IPR003594 and IPR004105 Chemotaxis Signal CheB methylesterase Transduction COG2201 CheB_methylest Chemotaxis Signal CheB methylesterase Transduction IPR000673 and IPR001789 CheB_methylest Chemotaxis Signal Two component Transduction CheB methylesterase COG2201 IPR001789 CheB_methylest Chemotaxis Signal MCP methyltransferase, Transduction CheR-type COG1352 CheR Chemotaxis Signal MCP methyltransferase, Transduction CheR-type IPR000780 CheR Chemotaxis Signal MCP methyltransferase, Transduction CheR-type with PAS/PAC sensor COG1352 CheR Chemotaxis Signal MCP methyltransferase/methylesterase, Transduction CheR/CheB with PAS/PAC sensor COG1352 CheR and CheB_methylest Chemotaxis Signal CheC,Transduction inhibitor of MCP methylation COG1776 Chemotaxis Signal CheD,Transduction stimulates methylation of MCP proteins COG1871 TIGR IPR004089 COG0840 COG0840 COG0840 COG0840 COG0840 IPR001294 IPR001294 and IPR004089 sensory_box CheW CheW CheW HATPase_c HPT and CheW MeTrc MeTrc MeTrc MeTrc sensory_box sensory_box 52 Summary of automated transporter annotation --- Zymomonas 317 Transporter Proteins 69 82 116 2 2 14 29 1 2 3 4 5 8 9 Channels/Pores Electrochemical Potential-driven transporters Primary Active Transporters Group Translocators Transport Electron Carriers Accessory Factors Involved in Transport Incompletely Characterized Transport Systems 23 46 73 9 103 2 13 2 1 1 14 12 17 1.A 1.B 2.A 2.C 3.A 3.B 3.D 4.A 5.A 5.B 8.A 9.A 9.B alpha-type channels beta barrel porins Porters (uniporters, symporters, antiporters) Ion-gradient-driven energizers P-P-bond-hydrolysis-driven transporters Decarboxylation-driven transporters Oxidoreduction-driven transporters Phosphotransfer-driven group translocators Transmembrane 2-Electron Transfer Carriers Transmembrane 1-Electron Transfer Carriers Auxiliary transport proteins Recognized transporters of unknown biochemical mechanism Putative uncharacterized transport proteins 53 Zymomonas transporters complete listing GROUP Porters Porters 2.A.53 sulfate transporter or Xanthine/uracil/vitamin C transporter carbonic anhydrase, sulfate transporter SulP family 2 proteins or0489 2.A.53 or1027 2.A.53 GROUP Porters Porters Porters Porters Porters Porters Porters Porters 2.A.6 putative lipooligosaccharide nodulation factor exporter, NolGHI, RND superfamily hydrophobe/amphiphile efflux-1 HAE1, RND superfamily acriflavin resistance protein, RND superfamily efflux transporter, RND family, MFP subunit acriflavin resistance protein, RND superfamily acriflavin resistance protein, RND superfamily hopanoid biosynthesis associated RND transporter like protein HpnN export membrane protein SecD, RND superfamily 8 proteins or0146 or0252 or0704 or1290 or1378 or1379 or1439 or1719 GROUP Porters Porters Porters 2.A.64 twin-arginine translocation protein TatC twin-arginine translocation protein TatB twin-arginine translocation protein TatA 3 proteins or1107 2.A.64.1.1 or1108 2.A.64.1.1 or1109 2.A.64.1.1 GROUP Porters Porters Porters Porters Porters 2.A.66 multi antimicrobial extrusion protein MatE polysaccharide biosynthesis protein polysaccharide biosynthesis protein polysaccharide biosynthesis protein virulence factor MviN family 5 proteins or0190 or0202 or1191 or1303 or1478 GROUP Porters Porters 2.A.69 predicted transporter, putative auxin efflux carrier component predicted transporter, putative auxin efflux carrier component 2 proteins or0625 2.A.69.2./1 or0626 2.A.69.2./1 2.A.6.3/2 2.A.6.2 2.A.6.2 2.A.6.2/3.A.1.122/8.A.1 2.A.6.2 2.A.6.2 2.A.6.5/7 2.A.6.4.1 2.A.66.1 2.A.66.2 2.A.66.2 2.A.66.2 2.A.66.4.1 54 Transcriptome Analysis Pipeline: RNA sequences to GRN Collect RNAseq data Map reads to genomes Predict operons In silico Compare operon determinations (genome coordinates) Improve algorithm Determine orthologs with OrthoMCL Determine orthologous operons Calculate reads/bp Display frequency plot Cluster analysis of gene expression changes Determine operons from frequency plot Align orthologous promoters Determine TISs with 5’ RACE. Determine TFBS from alignments Cluster analysis from gene expression arrays GRN genetic regulatory network Predict TFBS In silico Dynamic range and sensitivity New gene, wrong start, riboswitch Small Regulatory RNA ??? Differential gene expression Operon with Internal Promoter 60 Long Term Vision Develop TPing SOPs, and an automated analysis pipeline. Initially produce TPs and preliminary GRNs for all important DOE microbial genomes (i.e. BESC), and eventually all DOE microbial genomes. Incorporate the TP analysis pipeline into ORNL’s automated microbial annotation pipeline, and eventually into IMG and GenBank files. Add additional experimental methods to improve the GRN determinations.